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Preface 


Conventional wisdom is that the gender-gap or mean difference in math test 
scores favoring boys is negligible, vanishing, and on a path toward closure given 
appropriate societal adjustments. Furthermore, stereotypes and bias are the most 
frequently cited reasons for boys slightly larger math mean test scores. In fact, the 
mean gap favoring boys in math has existed since the earliest onset of achievement 
testing in the USA. It was evident in Stone's 1908 doctoral dissertation data and 
in New York City's massive math testing program in schools starting around 1910. 
The gap exists worldwide in most developed countries on the PISA test, as well 
as on the "Nation's Report Card" ог NAEP tests in the USA. Of far larger size, 
but of nearly negligible interest, is the mean gap dramatically favoring girls in 
reading. In fact, girls are so dominant in reading it is difficult to find any studies 
reporting boys have larger test score means, at least in developed countries. That 
reading has favored girls in test score mean has been known at least since Gray's 
dissertation on reading tests in 1917. The convolution or finite mixture model in 
this book explains test score sex differences for both tasks. Interestingly, the mean 
differences, for both reading and math, are the smallest of test score sex differences. 
The test score variance differences, with larger variance for boys on both tasks, 
simply dwarf the mean differences for both tasks. The focus for more than 40 years 
has been on the smaller test score sex differences, the mean differences—arguably 
the wrong differences on which to focus. The theory, which is marginally more 
complicated than the probability model for the flip of two coins, explains these mean 
and variance test score differences for both tasks. The theory does so by modeling a 
biological mechanism which has been for decades out of favor, is nearly universally 
ignored, and is often disparaged. The monograph's spirit is well captured by the title 
of mathematician Keith Devlin's book, The Math Gene. 


University Park, PA, USA Hoben Thomas 
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Chapter 1 A 
Focus on Math and Reading Test Score P 
Inequalities 


Worldwide there has been recognition that boys’ and girls’ reading and math test 
score distributions are different. What has not been recognized is precisely how 
these distributions differ. 

Large-scale achievement testing in the U.S. started around 1897 by J. M. Rice 
[1]. By 1902, the results of Rice’s large-scale U.S. elementary school arithmetic 
reasoning tests for children were available [2]. By 1908, following C. W. Stone’s 
efforts [3], it was clear that sixth grade boys and girls have different achievement 
test score distributions on each of two math tests Stone constructed. Stone’s data 
likely provide the first evidence of sex differences in math test score distributions 
in the U.S.A. and perhaps globally as well. His data are typical of many other more 
recent examples, with small mean differences favoring boys on both tests. Figure 1.1 
reveals, for the first time, data from Stone’s fundamental test which shows the 
smaller mean difference of his two tests. Stone was surely unaware of this mean 
difference however and Fig. 1.1 which is a kernel density estimation plot, a graph 
simply unimaginable for Stone. His tools are paper and pencil. Consequently, Stone 
computed no quantities and no graph from these data. He likely could only stare 
at the tabled data on 250 boys and 250 girls he provided for each of two tests. He 
mused at the apparent variabilities he saw. “Doubtless the most noteworthy feature 
of these tables is the wide variability of achievements [3, p. 32].” 

Stone's data, and Fig. 1.1 which shows that the boys’ distribution is slightly 
right tilted relative to the girls' distribution, make clear that today's emphasis on 
the “gender gap,” the small mean difference favoring boys in math, is nothing 
new. It was evident at the dawn of achievement testing, well more than 100 years 
ago. This fact also suggests the gap is not going away anytime soon. Early 
investigators may have had an inkling of such distributional differences but were 
unable to explicitly quantify their beliefs, which was certainly true for Stone. That 
distributional differences were evident at the outset of testing clearly weakens any 
contemporary claims that sex differences in math are the consequence of policy 
decisions or evolving social or environmental forces. 
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Fig. 1.1 Kernel density estimation plot of the sixth grade test scores of 250 boys and 250 girls on 
Stone's (1908) fundamental math test. The boys’ mean is 3.193, and the girls’ mean is 3.026, with 
corresponding standard deviations of 1.048 and 0.996 


Commentary and other table entries Stone provides indicate he was well aware of 
sex differences in test score variability, albeit assessed with methods now obsolete 
and evident in small samples of his data. Also, evident in Stone's early data is a 
pattern of statistical features evident today in math testing studies large and small, 
in the U.S.A. and worldwide, patterns unrecognized by Stone—with some patterns 
unrecognized by investigators of sex differences in math even today. There will be 
more on this matter momentarily. 

The situation with reading tests is similar. By 1917, Gray [4] recognized there 
were reading test score distributional differences between boys and girls, evident in 
the reading tests he constructed. But like Stone, Gray could not have appreciated 
the manner by which these test score distributions differed, although he did report 
girls' sample mean surpassed boys in oral reading in all grades one to eight of the 
elementary school years. Today, that girls surpass boys in reading mean is a nearly 
invariant universal global test result, at least in developed countries. And like the 
pattern of differences evident in Stone's math testing, a similar unrecognized pattern 
exists in reading testing data as well. 

Let хь and x, denote test score sample means of boys and girls, respectively, 
with s? and s? their corresponding sample variances. The way sex differences are 
defined today is by effect size d with 


Xp — Хе 
\/ (sp 52/2 


While d can be viewed as an empirical method of gauging sex differences, most 
of the writers treat it as an inferential quantity, in which case the appropriateness of 
d's use rests on the probability model for d or 6, 


d= 


_ Hb — Mg 
Oc | 


ó 
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and up and ug are the population means for boys and girls, respectively. oc is the 
assumed common population standard deviation. So d presumably estimates б. 

Although there has been no conceptual defense arguing that d is the most 
appropriate scalar index for behavioral sex differences data, d has nonetheless been 
for decades the nearly sole unquestioned index by which sex differences on a myriad 
of variables have been assessed in journal articles, meta-analyses, and books [5, 6]. 

However, d has led nowhere. It has provided no assistance in understanding the 
origins of sex differences in reading or math or why they persist. Worse, outcomes 
of d meta-analyses are often puzzling. A common math finding is that the mean d, 
that is, d , is positive but small and favors boys. For example, in Lakin [7], the overall 
math d — 0.077. Why these small (scaled) mean differences appear has befuddled 
investigators [8—10]. 

Meanwhile, it has long been recognized that typically 


Sp > Sg or 56/5; -1 


holds for both tasks, reading and math. These inequalities have also long perplexed 
researchers [9, 11—14]. 

Writers and those with a penchant for d-based meta-analyses, as well as many 
researchers, have slavishly, seemingly lemminglike, taken d as the ultimate sex 
differences scalar index. It has even been claimed d is useful for any sex differences 
variable [15, p. 177]. 5, the model for d, is conceptually inappropriate as the need 
to report se iS implicitly makes clear. More importantly, d is a misleading way 
of viewing sex differences, at least for math and reading test score data. That is 
because d has directed attention toward mean differences and misdirected attention 
away from the far larger variance differences in data. Once these differences 
are properly recognized, one is forced to recognize the inadequacies of d. More 
centrally however, it forces a need to rethink how differences between the sexes 
should be viewed, again at least for reading and math test score data. 

The key for understanding is not to focus on scaled sample mean differences, as d 
requires. Rather, focus instead on summary data statistical inequalities. In particular, 
focus on pairs of inequalities. These are the data patterns that must be explained by 
any creditable framework which is claimed to account for sex differences in math 
and reading. 

Define the following two sets, the inequalities of which may appear familiar, first 
for math testing and then reading testing, respectively: 


mo = {Xp > Xg & Sp > Sg} and 
ro = {Xg > Xp & sp > Sg}. 


The inequalities of the following two sets have apparently never before been 
recognized. Again, the first set is for math testing the second set is for reading 
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testing: 

РИЧЕ 

mdo = (sy — $$ > Xp — Xg, given mo} and 
rdo = {52 — s? > X, — Xp, given ro) 
= 15 g g b, 8 . 
Call these four sets collectively, 
S = (mo, ro, mdo, rdo}. 


Read mo as “math order,” ro as “reading order," mdo as “math difference order," 
and rdo as “reading difference order.” The primary task of this monograph can be 
easily stated: model the data structures of S. 

One's intuition with numbers can often lead one astray, regardless of how 
technically skilled one might be [16], and there may be repeated instances in the 
following pages when this may be the case. So, consider 


Xp = 152, x, = 149, sp = 34 + 3, s, = 34. 


Relative sizes are easy numerical comparisons, so it is easily seen in this math 
test example that mo is satisfied. However, mdo may seem not only a strange 
comparison but intuitively more challenging as well. One easily observes the 
difference s; — Sg = 3 which might seem small. But the required difference to 
consider is эр — s = 213, which perhaps seems larger than intuitively expected 
and far larger than хь — xg. Small differences in standard deviations can balloon 
into large differences in variance. There is not a unique way of thinking about 
the matter, but the numerical values of sp and sz above were written in the form 
w + b and w, respectively, a difference independent of w. The variance difference 
(w + b)? — w? = b? + 2bw increases quadratically in b or linearly in w, so the 
example provides perhaps an "intuitive correction" going forward. The example is 
from Table 2.3, Chap. 2, *The Nation's Report Card" 12th grade 2019 math testing 
result. 

Although it was claimed decades ago that the sex mean differences on cognitive 
tests were disappearing [17], or at least in more recent decades that the mean sex 
differences in math have decreased [18], or that the test gaps in math mean are 
*...close to zero in developed countries...[19, p. 1219],” keeping alive perhaps 
the hope that the small mean sex difference favoring boys in math test scores will 
soon vanish with suitable societal changes. This hope is simply wishful thinking. 
The empirical fact remains that, from the perspective of more than a century, 
the inequalities of S have widely held, and they continue to do so worldwide, in 
developed countries, with an occasional exception, but only for math. Furthermore, 
the mean gap is not even the largest sex differences gap as S makes clear. Moreover, 
at least in the U.S.A., the small math mean difference disadvantaging girls has 
garnered nearly all the attention. By contrast, girls’ mean advantage in reading is 
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multiples larger than boys' mean advantage in math. Yet by comparison, the interest 
in sex differences in reading has been little more than a footnote. 

The inequalities of S are so ubiquitous in the literature, as will be demonstrated 
in Chap.2, they may be said to characterize the structure of the literature for 
children's reading and math test score summary statistics. The data for Stone's 
fundamental test displayed in Fig. 1.1 satisfies mo. His reasoning test satisfies mdo. 
The inequalities of S are the focus of sex differences throughout. Importantly, they 
tightly constrain the class of models that accounts for them in a coherent way. 
Happily, as it turns out, accounting for these inequalities also provides the basis 
for explaining other widely recognized empirical facts long unaccounted for in the 
reading and math observational test score sex differences literature. 

This monograph has four main goals: 


1. To demonstrate the ubiquity of inequalities of S in the broad observational child 
math and reading achievement test score literature. 

2. To provide a coherent model that accounts for the inequalities of S. 

3. To provide coherent accounts for other well-recognized yet unexplained empiri- 
cal differences in boys’ and girls’ reading and math test score distributions. As an 
example, an issue of wide-spread puzzlement in math testing concerns the right 
tails of boys’ and girls’ test score distributions. The right tail for boys typically 
has a larger empirical data mass than the corresponding right tail for girls. 
Stone’s data in Fig. 1.1 provide suggestive evidence for this widely recognized 
empirical fact. Less well recognized is that focusing on the left tails of the reading 
test score distributions reveals boys have a larger empirical data mass than do 
girls. So, boys share an unexplained “infamy” for their anchoring of the tails 
of distributions associated with tasks arguably representing the most important 
skills children will ever need. 

Indeed, boys are far more overrepresented in the bottom of the reading 
test distribution than they are overrepresented in the top part of the math test 
distribution, as will be seen later. The goal is to address these puzzles in a 
coherent way within a suitable model. 

4. The final goal is to provide model parameter estimates, thus illuminating, with 
numerous associated graphics, the model’s plausibility. 


While the focus here is narrow and confined, namely the reading and math obser- 
vational test score literature, that literature is substantial. It includes reported test 
standardization data, large sample surveys, and observational studies of children’s 
test scores obtained in classroom settings. That girls achieve higher grades than boys 
in the classroom, a success that does not seem to translate to other settings is a well- 
recognized and important fact [20]. However, it is an example of a matter that is 
outside the neighborhood of focus here. 

Efforts to address sex differences, especially in math, nearly always cast a far 
larger net. For Ceci and Williams, 2010 [5], test score differences are only a small 
part of their far larger agenda. Their primary focus is why there are relatively few 
women in engineering, computer science, and other math-oriented occupations. 
These issues are certainly compelling but are not addressed here. However, some 
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findings from the present analysis seem to have relevance for understanding sex 
differences observed in larger social contexts. When this is recognized, the possible 
implications will be noted. 

The path taken here is different and likely to be unfamiliar. The literature 
on sex differences is dominated by conventional statistical approaches featuring 
regressions, correlational analyses, path diagrams, and effect sizes. No conventional 
statistical approaches are featured here. This announcement should not surprise. 
That is because no conventional approach has been able to coherently explain 
the inequalities of concern nor the many perplexing empirical facts of data. 
The evidence is clear: after decades of attention, there has been no compelling 
explanation, using a myriad of conventional statistical approaches, why the most 
fixated feature of attention, which has essentially defined interest in sex differences 
in math, the mean gender gap, exists or why it persists. Furthermore, the task is 
more complex than explaining the mean difference, the focus of nearly all research. 
The much larger variance differences must be addressed as well. And this goal must 
be achieved not just for math sex differences but also for reading sex differences. 

Consequently, a new framework must be provided. It is developed below. That 
new framework is a probability model called model Y. It is developed in Chap. 4 
and is the framework within which all sex differences in test score distributions 
are viewed. The model forms the only basis for interpretation of sex differences 
throughout. In addition, of course, .V requires a new parameter estimation or 
statistical procedure which is provided. One feature of the approach is that there 
is no appeal to covariates as is commonly the case. Covariates may be important, 
but they play no role in the analyses that follow. 

In addition, the path forward is likely to be, for some readers at least, unpopular. 
It would certainly be preferable if this were not the case. However, to explain test 
Score sex differences coherently, there appears no choice: a genetical theory of 
X-linkage appears to be the only viable perspective available, and when suitably 
modeled, it is remarkably parsimonious. It requires just three parameters, the same 
number as ô, so it might be hoped readers can warm to the idea. Such an idea 
was decades ago mocked or ignored, an attitude that persists today, at least in 
some quarters. However, a model so based accounts remarkably well for the sex 
differences observed, as well as the inequalities. To anticipate matters, the empirical 
inequalities of S are just expectations under the model proposed. 

Genetical models have never been widely popular, at least in most psychological 
circles, and their role in explaining behaviors is often grudgingly acknowledged if 
acknowledged at all. However, it is important to recognize that genetical models 
have never been proposed to explain behaviors when other coherent explanations 
were available. As Ceci and Williams 2010 note, advances in understanding 
*...comes from free and open debate in which all sides present their best evidence 
and no one is excoriated for arguing the unpopular side [5, p. 219]." 

This plea for openness was expressed more than a decade ago. Is the current 
psychological science environment open to unpopular perspectives? Staddon [21] 
certainly does not think so. He castigates individuals, universities, funding insti- 
tutions, and more for their lack of openness in science to ideas and intellectual 
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diversity. Staddon compares the current situation, especially with respect to under- 
standing group cognitive differences, with Lysenkoism [21, p. 167]. In fact, the spirit 
of Lysenkoism is now seen as threatening all of science [22]. 

Can one, nonetheless, expect a spirit of openness to ideas and conceptualizations 
regarding unexplained sex differences in reading and math test scores? One certainly 
hopes so especially for what follows here. But the overall outlook appears cloudy 
at best if the path forward is to report only the most austere empirical facts. An 
example is a recent article which reports sex differences associated with the most 
innocuous, banal, everyday adult behaviors, such as cooking, sewing on a button, or 
washing cars [23]. Yet it is written in a remarkably cautious style. Just how to refer 
to sex differences requires, for the authors, a definition: *... we label differences 
by the hybrid neologisms of gender/sex. .. and sex/gender. .. and apply these terms 
interchangeably [23, p. 1340].” They strive to avoid any scintilla that would suggest 
mechanisms producing such differences by stressing “...we assiduously avoid 
discussing particular causal explanations of gender/sex differences and similarities 
or indicating our personal preferences for any theories of causation [23, p. 1340]).” 
And some readers may view their discussion as seeming to apologize for even 
pointing out that there are sex differences, perhaps fearing they might antagonize 
some readers (in particular, journal editors). They write near their close “...our 
insights that sex/gender differences can be simultaneously large and small might 
appear to some readers to threaten gender equality [23, р. 1354].” Should this last 
quote be understood to mean simply recognizing that to “cook meat on the grill" is a 
more masculine than feminine task is likely to pose a threat to gender equality? One 
can guess at the ramifications for the study of sex differences should this writing 
style become the template for the future. It would surely curtail a core goal of the 
scientific process, causal explanations, or as Judea Pearl would put it: Why? [24]. 

Doubtlessly, anyone who has studied the sex differences literature has observed 
the inequality pairs in mo and ro. They are simply too dominant in the literature to 
avoid notice. The surprise is that the тао and rdo inequalities widely hold. In fact, 
the size of the sample variance differences usually far exceeds the sample mean 
differences. It is difficult to imagine any other fact of empirical data that would 
signal a more robust rejection of conventional perspectives on how sex differences 
are viewed, namely that mean differences capture the essence of distributional 
differences, at least for math and reading sex differences test scores. These empirical 
facts, to be surveyed in Chap. 2, would seem to provide overwhelming conceptual 
problems for those who claim the sex differences in math are vanishing, ignorable, 
or may not exist at all [25]. 

Small sex differences, in sample means, especially in math, have enjoyed the 
attention while variance differences have largely been ignored, or at least from a 
substantive perspective, discounted. The reasons for this emphasis on means and not 
variances are not difficult to discern and involve both technical and psychological 
considerations. One psychological reason is that it is easy to think in terms of 
group differences as simply realizations of visual images likely to be familiar: 
two identically shaped typically normal-like or “bell-shaped” distributions shifted 
slightly apart, consequently differing only in mean. This is an example of what are 
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termed /ocation-shift models; the intuition is compelling. ó is a scaled version of 
a location-shift model. Where variances and their differences are concerned, these 
intuitive advantages seem to disappear. 

The dominance of effect size d as a sex difference defining assessment tool has 
certainly reinforced this mean difference perspective. Of course, the widely used 
expression "gender gap" reinforces this cognitive perspective. Sometimes variance 
parameters are called nuisance parameters [26, p. 3] signaling the perceived 
unimportance of variances. The most commonly used statistical models often 
assume equal population variances, with a corresponding central attention on 
population means. Analytically unequal population variances have been historically 
troublesome for statistical theory and consequently applied researchers as well. 
Some of these issues seem practically unimportant now with computing replacing 
analysis, as with the bootstrap [27]. 

Still, thinking intuitively about the mechanisms that might produce variance 
differences in behavior seems cognitively difficult. Yet sample variance differences 
are by far the dominate feature of reading and math test score sex difference 
summary statistics. This is the reality, and this reality has been unchanged for 
decades. To repeat, any sex differences theory must account for variance differences 
and the mean differences as well. This fact was recognized by Feingold in 1992 
[28] who argued, correctly, that to understand cognitive sex differences both means 
and variances must jointly be considered. But efforts to do so have not subsequently 
appeared. 

Now, fully 30 years later appears an article with title “Joint Consideration of 
Means and Variances Might Change the Understanding of Etiology" [29]. The 
article is a welcome addition. But it focuses largely on adoption or twin studies 
within the context of conventional behavioral genetics approaches and does not 
consider its larger conceptual importance. No new framework is proposed. The 
place to start is at the model level and perhaps with discrete probability models. 
That is because in all popular ones, the binomial, negative binomial, Poisson, 
and geometric, the means and variances are linked through their shared common 
parameters. Instead, the authors focus at the data analysis level. 

The literature review will consider only readily accessible reading and math 
observational test score literature and focuses on whether the inequalities of S are 
satisfied or not. Thus, mostly reported are studies which report the key quantities 
of focus, the sample means, and sample variances (or standard deviations) for both 
boys and girls. These statistics will be denoted by V defined by 


V = {Xp, Xg, Sb, Sg}. 
If sample sizes are also available, V will appear with six elements 
V = {Xp, Xg, 5р, Sg, Nb, Ng}, 


where np and ng are boys’ and girls’ sample sizes, respectively. 
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In all that follows, V are the only data of focus. No new data are reported. Unless 
V is explicitly reported, mdo and rdo cannot be evaluated. If mdo holds or rdo 
holds, then mo or ro must hold as well. However, mo or ro can hold while the more 
constraining inequalities of differences, mdo or rdo, can fail. They can fail simply 
because of the noise of real data, and probabilistically, they are more likely to do so. 
Sometimes reported are d and i /52, which enables то or ro to be evaluated, but 
without being able to explicitly define the elements of V, in which case mdo or rdo 
cannot be evaluated. 

Thus, the thrust of the literature review is to illustrate the dominance of the 
inequalities in several domains. This perspective departs sharply from the current 
practice which defaults to d as the sex difference index, which is often coupled with 
the belief that small d, however consistently signed, or that ratios of sê / s? however 
consistently greater than one, are apparently ignorable. 

In fact, the consistency of mo and ro in the observational math and reading 
test score literature must stand as one of the most remarkably consistent and easily 
observable empirical findings in the child literature. And the evidence has been but 
a mouse click away. To anticipate, simply glance at the “Nation’s Report Card” data 
[30] reproduced in Table 2.1 through Table 2.4 in Chap. 2. That mo or ro widely hold 
is immediately evident. Less obvious is that the inequalities of rdo and mdo nearly 
always hold as well. These results are based on millions of children assessed over 
decades using evolving but remarkably sophisticated and representative research 
protocols. 

There has long been a dismissive attitude toward small differences in sample 
values, especially sample means. For example, Miller and Halpern write “Sex 
differences in average mathematics test performance also decreased during the 
1970s to 1980s and have since remained small to negligible [18, p. 38, italics 
added].” In fact, they go further explicitly defining cognitive sex differences as mean 
differences, so variance differences are apparently irrelevant. One would expect 
s? / s? to fluctuate about one, if ô, the model for d, were appropriate. Additionally, 
one would expect d to fluctuate about zero if boys and girls followed the same test 
score distribution. In fact, neither is the case. Miller and Halpern [18] reference 
Lakin [7] to support their claim. Lakin reported data implying 28 V. Of these, 26 
satisfied mo. Thus, all but two implied V satisfied both хь > Xg and sp > sg. 
Furthermore, Lakin employed CogAT test standardization data with huge sample 
sizes and data spanning decades. 

A dramatic example of sê / 52 > 1 consistency was dispatched as unimportant. It 
appears in Science, where a green colored headline proclaims: “Standardized tests 
in the U.S. indicate girls now score just as well as boys in math [9, p. 494].” The 
data source is state math test data from 10 U.S. states. In Table S1 [31] are reported 
d, variance ratios, and sample sizes, data equivalent to 66 V. The sample sizes are 
typically in the tens to hundreds of thousands. Of the 66 variance ratios, 65 satisfy 
js / s? > 1. This fact was noted but dismissed with the observation: “However, none 
are very large апа... the male variance is not markedly greater than female variance 
[31, p. 2]? Such remarkable consistencies need an explanation, not dismissal. 
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Much more variables were the 66 d which range from —0.13 to 0.10, with d = 
—.007 a rare negative value computed from data in their Table S1. However, how 
the tests vary from state to state and their corresponding content appears unknown. 
It is only on math tests with a reasoning component that boys display a typically 
small mean advantage. This will be clear in Chap. 2. These data also speak to the 


consistency with which the variance differences, 87 = s? are often more consistently 


positive (or equivalently, 3; i > 1), while the mean differences x, — Xg often seem 
more apt to fluctuate in sign—but only for math: for reading, хь — x, > 0 essentially 
never appears. 

The perspective here is that it is the consistency of the outcomes over realizations 
of V that is far more important than the sizes of the mean differences within 
V which, only for math, and not reading, are often small. This attitude contrasts 
sharply with some other perspectives and attitudes toward regularities in empirical 
data. An example concerns d or d , which, as noted earlier, has persistently been 
observed in math meta-analyses to be positive but small, favoring boys. This 
empirical fact apparently has been an irritation. A “solution” has been proposed: 
simply define the problem away! Hyde and colleagues have long defined d < 0.10 
as trivial [9, 15, 25, 32] (The definition probably intended is |d| < 0.10.) or 
alternatively as “...an effect so small as to be considered no gender difference [33, 
p. 8801]” A conceptual explanation of why small d > 0 persistently appears in 
math studies and corresponding d > 0 in meta-analyses will be provided later in 
Chap. 6. 

Kagan in 2012 viewed psychological research as in crisis and suggested several 
reforms researchers should take to fix the matter. Kagan writes: “The most important 
reform urges a search for patterns in the body of evidence. ..[34, p. 249].” The 
thrust here is very much in this spirit. А goal of the literature review is to establish 
the consistency of the patterns of inequalities of S over several domains, including 
children's ages, countries, tests, and time frames. 

Following the literature review in Chap.2 and a chapter on alternative per- 
spectives in Chap.3, the theory will be developed in Chap.4. As noted, it will 
be revealed that the inequalities of S are simply expectations under the theory 
proposed. Consequently, the data inequalities are nothing surprising. Subsequently, 
in Chap. 5, many math and reading V examples will be analyzed, their parameter 
estimates given, with many solutions graphically portrayed. Chapter 6 summarizes 
matters, extends the analysis in some ways, and applies the model framework to 
selected settings. 

Transparency and Openness (TOP): all analytical results, estimation procedures, 
and computer code are contained herein or are in Appendix A. АП V which have 
been identified, which are publicly available and readily accessible, form the basis 
for the review in Chap. 2. It would appear that the research reported here satisfies 
TOP level three criteria for all eight standards, where applicable [35]. 
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Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/ ), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate 
credit to the original author(s) and the source, provide a link to the Creative Commons license and 
indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter's Creative 
Commons license, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter's Creative Commons license and your intended use is not permitted by 
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from 
the copyright holder. 


Chapter 2 A 
Literature Review with Focus сюр | 
on Inequalities 


Evidence of the inequalities of S come from several sources. They are widely 
evident in large scale U.S. national and international PISA studies, which provide 
compelling evidence. Numerous other studies including the earliest U.S. achieve- 
ment tests also provide similar evidence. 


2. U.S. National and International Studies 


2.1.1 NAEP: “The Nation's Report Card" 


U.S. reading and math data are available from the largest most representative and 
congressionally mandated U.S. assessment, the National Assessment of Educational 
Progress (NAEP), known as the "Nation's Report Card” [30]. From NAEP's 1969 
start when in the early days John Tukey was technical advisor, it has evolved both 
technically and legislatively. Millions of children have been assessed over the years, 
with from 10,000 to 20,000 children involved in nationwide grade level assessment 
[36]. There can be little doubt that NAEP tests are by far the best estimates of 
children's math and reading achievements for the U.S.A. and thus correspondingly 
information on sex differences as well. The NAEP tests have been called the “gold 
standard" for monitoring children's academic progress [37, p. vii]. However, the 
estimates provided are not conventional simple random sample estimates nor do 
they possess random sample estimate properties. 

The NAEP sampling procedure is multistage, clustered, and stratified. First, the 
U.S.A. is partitioned into primary sampling units (PSUs) which involve one or more 
counties, and then there is sampling of schools within the PSU and then students 
within schools. About 30 children are tested within sampled schools [38] with each 
tested child receiving a random sample of test items, in one of many different test 
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booklets [26, Chapter 2]. Thus, no child receives more than a very small sample of 
items, and estimates of individual children are not possible, but groups of children 
can be compared. Coupled with the NAEP complex sampling procedure is the 
estimation procedure involving Item Response Theory, latent variables, multiple 
imputation, numerical integration, student weights, and more [26, 37, 39]. A further 
consequence is that sample sizes are not meaningfully associated with estimated 
quantities. 

The most elementary generally recognized necessary requirement of any sta- 
tistical estimate is that it is consistent. That is, one “does better" as sample size 
increases, so the sample estimates approach their parameter values as sample sizes 
increase. The consistency of the estimates in NAEP-like settings is not assured [26, 
p. 29]. This fact is made intuitively plausible at the individual subject level because 
each child is given only a few of the possible test items available. Consequently, 
the consistency of an estimate of a child's ability may not be achieved, a fact that 
can have implications in subsequent procedures. For the NAEP estimated means 
and standard deviations, standard errors are not reported. Their calculation, usually 
involving jackknifing, is not straightforward given the complexity of the sampling 
and estimate procedures [39]. 

These details concerning the NAEP sampling and estimation procedures make it 
abundantly clear that conventional textbook statistical tools useful for random sam- 
ples are simply inappropriate. However, some other investigators have, nevertheless, 
forged ahead with standard statistical procedures [40]. 

These same general procedures employed in the NAEP surveys are employed in 
the international surveys, such as the PISA tests, noted below, which were modeled 
after the NAEP procedures. And these facts have further implications for estimates 
based on V estimates from large sample survey procedures. It means, as is noted 
subsequently, that there is no guidance for the construction of standard errors of 
parameter estimates based on V from large sample surveys. 

Tables 2.1 and 2.2 display NAEP math and reading means and standard devia- 
tions at Grades 4 and 8. Tables 2.3 and 2.4 display math and reading at Grade 12 
[30]. The columns of these tables are labeled xp, Xg, Sp, and sg. This notation is for 
consistency only; they are not conventional sample means and standard deviations 
for reasons just addressed. In these tables, those V which fail to satisfy either mo 
or ro appear in bold font. Otherwise, all V satisfy the stronger order relation, mdo 
or rdo. In Table 2.1, of 26 V for math, 23 satisfy mdo, while 23 of 27 reading V 
satisfy rdo in Table 2.2. Similarly, Tables 2.3 and 2.4 display twelfth grade math 
and reading data. For twelfth grade reading, seven of eight V satisfy rdo. All five 
math V satisfy mdo. Collectively, of the 66 NAEP V, in these tables 58 or all but 8 
satisfy mdo or rdo. Seven of the eight failures are ties, likely the result of rounding. 

The mean differences in math are always small. The boys' mean never exceeds 
the mean of the girls by more than three points, but it is the consistency, spanning 
decades, which is striking. While the mean differences in math are always small, the 
mean differences in reading favoring girls are substantially larger, and girls always 
have larger means. The girls' mean reading advantage ranges from 6 to 15 points, 
and girls never show less than a 9-point mean advantage at the eighth and twelfth 
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Table 2.1 NAEP Grades 4 
and 8 Math 


Table 2.2. NAEP Grades 4 
and 8 Reading 


Table 2.3 NAEP Grade 12 
Math 
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39 


39 
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42 


39 
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213 


36 


34 


Note: All V satisfy mdo 
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Table 2.4 NAEP Grade 12 


- Year | Xp X Sp |S 
Reading £ £ 


2015 |282 |292 |41 |39 
2013 |284 |293 |39 |36 
2009 |282 |294 |39 |36 
2005 |279 |292 |39 |37 
2002 |279 |295 |37 |36 
1998 |285 |298 |38 |35 
1994 |280 |294 |36 |36 
1992 |282 |297 |33 |32 


Note: Bold font denotes ro 
failure 


grade levels. Of those V satisfying mdo, the median ratio of (52 Е 52) / Xp = Xg) 
is 72, so the variance differences dwarf the mean differences. The median of the 
corresponding rdo reading ratio (sÈ — 52) /(Xg — Xp) is 13. In summary, mdo and 
rdo widely hold in the NAEP reading and math data. The story is little changed in 
other settings. 


2.1.2 PISA Tests 


The 2015 Programme for International Student Assessment (PISA) tests [41, 42], 
which involve 5000 students from each country, reveal that for those 15 to 16 year 
olds, girls’ reading average exceeded boys’ average in all 44 countries, averaging 
27 points. Among 42 countries with math data, boys' average exceeded the girls' 
average in 35 countries by 8 points. Variances are not easily accessible for the 2015 
data. 

Tables 2.5 and 2.6 display the PISA means and standard deviations for math and 
reading for 41 countries in the 2003 testing cycle [43, 44]. While these published 
reports do not contain means and standard deviations for each country, they were 
graciously provided by Tuomas Pekkarinen, of the Helsinki School of Economics. 
Again, these means and standard deviations are not conventional sample estimates, 
although the column labels use familiar notation. In Table 2.5 mdo holds for 36 of 
4] countries; for the five countries shown in bold font, то fails. In Table 2.6, all 41 
reading V satisfy rdo. 

As with the NAEP data, the sizes of the boys math mean exceeding the girls math 
mean are far smaller than for the corresponding reading differences favoring girls. 
The 39 positive хь — x, math differences ranged from 1.21 to 28.84 with mean 
difference 11.34. For reading, all x9 — xp were positive, with differences ranging 
from 13.27 to 57.76 and mean difference 33.63. Clearly, girls dominate boys in 
every OECD country in reading at least in 2015 and 2003. 
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Country Xp Xg Sb 

Australia 526.89 | 521.55 | 99.23 
Austria 509.39 | 501.82 | 97.12 
Belgium 532.88 |525.37 |114.35 
Brazil 364.70 | 348.44 | 104.18 
Canada 540.77 | 529.60 | 92.16 
Czech Republic 523.84 | 508.87 | 97.32 
Denmark 522.73 | 506.15 | 90.77 
Finland 548.00 | 540.60 | 87.63 
France 515.28 | 506.76 | 95.68 
Germany 507.87 | 498.90 | 105.07 
Greece 454.95 | 435.55 | 98.13 
Hong Kong 552.40 | 548.35 | 107.44 
Hungary 493.70 | 485.90 | 95.46 
Iceland 507.65 | 523.06 | 94.55 
Indonesia 361.84 |358.50 | 79.35 
Ireland 510.18 | 495.38 | 86.32 
Italy 474.92 |457.09 | 100.97 
Japan 538.53 | 530.11 | 106.74 
Korea 551.71 | 528.31 | 93.35 
Latvia 484.84 | 482.03 | 91.77 
Lichtenstein 549.84 | 521.00 | 105.68 
Luxembourg 501.93 | 484.76 | 94.77 
Macau-China 538.19 | 516.94 | 91.32 
Mexico 390.87 | 379.97 | 87.04 
Netherlands 540.33 | 535.22 | 92.39 
New Zealand 530.71 | 516.23 | 101.51 
Norway 498.27 | 492.05 | 95.96 
Poland 493.04 | 487.45 | 95.50 
Portugal 472.44 | 460.19 | 93.23 
Russian Federation | 473.50 | 463.38 | 96.30 
Slovakia 507.29 | 488.63 | 94.94 
Serbia 437.48 | 436.27 | 90.14 
Spain 489.61 | 480.74 | 92.28 
Sweden 512.31 | 505.78 | 96.86 
Switzerland 534.58 | 517.95 | 100.40 
Thailand 414.77 | 418.79 | 84.16 
Tunisia 364.91 | 352.74 82.31 
Turkey 430.23 | 415.09 | 109.00 
United Kingdom 511.80 505.14 | 93.66 
United States 485.96 | 479.71 | 99.15 
Uruguay 428.39 | 416.30 | 102.01 


Note: mo fails for countries in bold font. 36 countries 


satisfy mdo 
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Be 2.6 PISA Reading Country E X, 5b Sg 
Australia 506.09 | 545.43 | 100.47 | 89.81 
Austria 467.13 | 514.35 | 105.46 | 94.91 
Belgium 489.33 | 526.23 | 113.55 | 102.58 
Brazil 384.22 | 418.85 | 115.85 | 104.57 
Canada 514.00 | 545.53 | 92.77 | 82.58 
Czech Republic 473.10 | 504.40 | 95.36 | 93.05 
Denmark 479.39 | 504.80 | 89.77 | 85.03 
Finland 521.39 | 565.41 | 82.43 | 73.34 
France 476.10 | 514.29 | 99.99 | 90.50 
Germany 470.80 | 512.93 | 111.45 | 102.20 
Greece 452.88 | 490.37 | 110.19 | 95.54 
Hong Kong 493.83 | 525.36 | 90.67 | 75.26 
Hungary 467.24 |498.20 | 92.93 | 88.12 
Iceland 463.81 |521.57 | 99.92 | 87.25 
Indonesia 369.48 |393.52 | 75.43 | 75.27 
Ireland 501.08 | 530.10 | 87.08 | 83.51 
Italy 455.24 |494.59 |105.42 | 92.23 
Japan 486.57 |508.98 |110.70 | 99.11 
Korea 525.48 |546.73 | 83.38 | 79.77 
Latvia 470.40 |509.14 | 93.23 | 83.50 
Lichtenstein 516.60 |534.00 | 93.19 | 85.70 
Luxembourg 462.66 |495.66 |103.3 93.22 
Macau-China 490.82 | 504.09 | 69.41 | 63.92 
Mexico 388.59 |410.07 | 96.23 | 92.93 
Netherlands 502.87 | 523.78 | 85.70 | 82.63 
New Zealand 507.73 | 535.35 |107.14 | 100.20 
Norway 475.34 |524.54 |105.07 | 93.48 
Poland 476.78 |516.33 | 99.65 | 87.77 
Portugal 458.52 |494.86 | 97.24 | 84.84 
Russian Federation | 427.84 |456.36 | 97.68 | 86.39 
Serbia 389.93 | 433.05 | 83.29 | 73.70 
Slovakia 453.28 |485.82 | 92.71 | 89.36 
Spain 460.66 |499.78 | 98.65 | 88.08 
Sweden 495.91 | 532.66 | 96.22 | 91.39 
Switzerland 481.99 (517.49 | 96.22 | 89.76 
Thailand 396.45 |439.17 | 78.35 | 72.48 
Tunisia 361.77 |387.10 | 95.22 | 94.63 
Turkey 425.97 |459.3] | 98.85 | 87.25 
United Kingdom 491.82 |520.37 | 95.86 | 90.20 
United States 479.29 |511.30 |103.82 | 95.80 
Uruguay 414.02 |453.32 |125.44 | 114.40 


Note: All 41 countries satisfy rdo 
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2.1.3 IEA PIRLS and TIMSS Tests 


The International Association for the Evaluation of Educational Achievement (IEA) 
sponsors the Progress in International Reading Literacy Study or PIRLS test which 
assesses reading among fourth graders and the Trends in International Mathematics 
and Science Study or TIMSS math tests which assess fourth and eighth graders as 
well as a more advanced TIMSS-advanced math test administered in the last year 
of secondary school. Among 79 V, mostly different countries in the 2001 and 2006 
PIRLS reading assessments [14], 62 satisfied rdo. Never did the boys' mean exceed 
the girls’ mean in reading. Among 55 countries and 55 V in 2015 TIMSS fourth 
graders math test, 32 of 55 satisfied mdo; for the 2015 TIMSS eighth graders, 
among 46 V 19 satisfied mdo, while among 10 counties in the 2015 TIMMS- 
advanced math, 7 satisfied mdo [45]. The TIMSS fourth and eighth grade tests but 
perhaps not on the advanced test reveal outcome patterns relatively different from 
those for NAEP or PISA tests, and the reason seems to be the content of the tests. 
It is generally understood that the TIMSS items assess more basic skills than do the 
PISA and NAEP math tests [46] and thus are correspondingly less comparable to 
them. As will be noted shortly, it has been known for a century from large sample 
New York City school data, girls exceed boys on some arithmetic tests in grade 
school and high school. 


2.2 Publicly Accessible Individual Studies Reporting V 


Six thousand entering students in 47 California junior colleges received the math 
portion of the Iowa High School test during the 1929—1930 class year. mdo holds 
[47]. Of 21 V 19 satisfy ro, and among these 19, 17 also satisfy rdo for seven 
grades and three reading tests [48]. Among a combined third grade sample, rdo 
holds [49]. For 10-year old children from eight schools, rdo holds [50]. Among 
four V for math-precocious kindergarten and preschoolers, mo always holds and 
mdo holds for three V [51]. Among eight V for reading and four V for math, mdo 
and rdo always hold [52, 53]. Among elementary Grades 2 through 6, five V, mdo 
always holds [54]. Among SAT assessments, mdo holds for all eight V [12]. In 
Johns Hopkins University talent searches employing the SAT math scores, mdo 
holds for all 10 V [55, 56]. Among twins assessed at ages 7, 9, and 10 and for 
three different math skills and one reading test at each age, mo and ro hold for all 
12 V [57]. Lakin was noted above [7]. She reported standardization data implying 
28 V with results spanning 27 years, for ages 9 to 17, 26 of 28 V satisfy mo, and 
mdo could not be evaluated. Among four V summarizing Taiwanese children's math 
test performance, four satisfied mo, and three satisfied mdo [58]. In an Australian 
sample for children of fifth year schooling for math mdo holds, while for reading 
Sb > Sg fails [59]. Cascella [60] reports two large sample V based on fifth and tenth 
grade Italian children, both satisfy mdo. А massive 2013 Italian math study involved 
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*. ..the entire population of Italian children in school years 2, 5, 6, 8 and 10, [61, 
р. 4] Reporting implied all years satisfy mo except year 5 where sp = Sg. mdo 
could not be evaluated. Two V for sixth grade Kenyan children, one V for private 
schools, the other for public schools, both satisfy mdo [62]. Maccoby and Jacklin 
[53, p. 90] report two V for Project Talent where mdo holds for both and one ACT 
test V where mo holds but mdo fails. They report two other studies for which mo 
fails. 

In the preceding paragraph, there is reference to 88 math V. AII but nine appear 
independent of one another, and all but five V satisfy mo or mdo. For reading, 
reference is made to 35 V. АП but three V appear independent of one another, and 
all but three satisfy ro or rdo. 


2.3 Early Twentieth Century U.S. Reading and Math Tests 


The earliest empirical findings on achievement test score sex differences provide a 
useful backdrop for viewing contemporary hypotheses regarding sex differences in 
math and reading achievement testing. There are, however, difficulties in assessing 
this early literature. While V are desired, they are infrequently available with the 
elements of V being replaced by other easier to compute quantities. However, it 
was evident early on that boys and girls differed both in their math and in their 
reading achievement test distributions as Gray and Stone reported in their 1917 
and 1908 doctoral dissertations, respectively [3, 4]. These differences largely hold 
independent of age or grade level or type of test. The main exception would seem to 
be that girls exceed boys in tests of elementary arithmetic. 

Remarkably, the early investigators tested large sample sizes which they could 
not hope to fully analyze given the tools available to them. The earliest U.S. 
achievement testing appears to have been initiated by J. M. Rice, a New York City 
educated physician who ended his practice in 1888, then studied psychology at 
universities in Leipzig and Jena, before turning his interest to achievement testing. 
He constructed achievement tests for both spelling and math. He reported in 1897 
results of testing 33,000 children on his spelling test [1]. He then turned to math 
achievement testing. Rice was said by Stone to have developed the earliest math 
achievement test in the U.S.A. [3, p. 95]. Rice constructed eight arithmetic word 
problems for each grade, fourth through eighth, then ultimately tested 5,903 children 
in 18 schools in seven cities [2]. Rice never addressed issues of sex differences in 
test performance. 

Around 1900, only paper and pencil calculation of any desired quantities was 
likely possible; not until 1902 was a printing calculator with addition and subtraction 
operations introduced. Only decades later would four operation machines be 
available [63]. Graphs were done manually as well and of course so was test 
scoring and evaluation. One result was that only a portion of the data collected were 
numerically summarized in some fashion. Because the sample mean and sample 
standard deviation required algebraic calculations, they were often replaced by 


2.3 Early Twentieth Century U.S. Reading and Math Tests 21 


easier to compute approximations. Thorndike wrote a book addressing the problem 
[64]. A key quantity for Thorndike was the median, which requires essentially 
only ordering data and counting. The median could replace the mean, and the 
median formed the basis for other approximations. The Average Deviation or 
AD — У. |x — М |/п, where x denotes a test score, n is sample size, and М 
denoted the mean, or the median if the mean was not available. AD approximated the 
standard deviation. The coefficient of variation, s/x, the sample standard deviation 
divided by the sample mean, could be replaced by AD/M. For Pearson's r, two 
approximations were suggested which, in examples Thorndike provided, were much 
larger than r, sometimes by 20% [64, p. 31]. 

One silver lining of these early computational difficulties may have been the 
tendency to report tabulated raw frequency data, sometimes with little comment 
other than what could be gleaned by a visual glance. Constructing tables of score 
frequencies would seem far easier than computing summary quantities by hand. 

Most early U.S. sex difference studies are summarized by Lincoln (1927) [65] in 
numerous tables. Table and page references in this section below are all to Lincoln, 
unless noted otherwise. Lincoln's book was his Harvard education doctorate, and 
according to him, it was the first attempt to gather together sex difference research 
findings on a myriad of variables. Two earlier reviews actually appeared [66, 67] 
which Lincoln does not reference. This is understandable, however, because the 
sources report no useful numerical data. 

There is clear evidence of mdo holding in the early literature. While V are 
reported for some math results, no V appear to have been reported for early reading 
results for reasons just noted. Already noted were Stone's [3] efforts constructing 
two tests, he called fundamental and reasoning, intended to assess sixth grade 
children. Others used his tests more widely. Based on the analysis of data, he reports 
[3, p. 30—32], and as noted earlier, one test satisfies mo, and the other mdo. Stone's 
test data will be analyzed in Chap. 5. 

In his 1916 University of Chicago dissertation, Gray produced a series of oral 
and silent reading tests which were widely used early on (versions of which are 
used today). He was probably the first one to report sex differences in reading 
achievement. He provides no useful data for analysis, but his Diagram XIII [4, p. 
127] is reproduced here as Fig. 2.1. "The diagram shows that in all grades girls do 
better than boys in oral reading [4, p. 126].” 

Gray's tests were the basis for providing likely the first large scale assessment 
of reading achievement, in the 1918 Survey of the St. Louis Public Schools [68, 
69] a massive work of which a portion was devoted to reading, writing, arithmetic, 
and music assessments; only for the reading tests are there data on sex differences. 
The volumes were intended as supporting documents for a proposed school bond 
offering. 

Both the oral and silent reading tests developed by Gray were based on short 
paragraphs each child read, during which a child's time in seconds to read each 
paragraph was recorded, along with several types of reading errors. An oral reading 
score depended on the seconds taken to read each of twelve paragraphs together 
with the number of errors made. For silent reading, the length of time required to 
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Fig. 2.1 Graph from Gray 
[4, p. 127] showing girls’ 
mean exceeding boys’ mean 
for Grades 1 to 8 on the Gray 
Oral reading test 


read three short paragraphs determined a child's silent reading rate with answers 
provided to questions about the content read. The results are summarized in two 
tables in Lincoln which appeared earlier [69]. Table 35, based on 5,118 children, 
shows girls exceed boys in oral reading for Grades 1 to 8 at seven of eight grade 
levels. The silent reading test scores, Table 36, were based on 4,463 children 
with two performance indices, quality scores, and rate scores. For five grades, two 
through eight, boys exceed girls in all grades on quality, which presumably reflects 
the “... ability to master the thought of what is read [69, p. 170].” Using the reading 
rate score, or reading speed, girls exceed boys in four of the seven grade levels and 
are tied at one grade. The number of boys and girls at each grade in these tests was 
typically above 300. 

Lincoln reports reading scores by other investigators and other tests in four 
additional tables. In all cases, the performance variable favors girls, sometimes by 
wide margins. Table 37 reports a before and after experiment comparing reading 
rates. For six grade levels at the beginning, girls’ averages exceed boys’ averages in 
five grades. At end, girls exceed boys in silent reading for four of six grades. Table 
38 displays both rate and comprehension scores for six grades in rural Iowa. For both 
indices, girls exceed boys in five of six grade level comparisons. Table 39 reports 
quality scores at nine ages, 7 to 15; girls exceed boys in median scores at all ages. 
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And when speed was the performance variable, girls are vastly superior, at eight of 
nine ages, reported in Table 40. Concerning sex differences in reading variability, 
in three of four tables [65, pp. 149-154] with sex comparisons spanning ages 8 
to 15 years, boys' reading test standard deviations exceed the standard deviation for 
girls in both quality and reading speed in five or more of seven comparisons in each 
table. Lincoln concludes “The weight of the evidence seems to indicate that girls 
are somewhat superior to boys in reading [65, p. 72].” 

S. A. Courtis [70] constructed eight arithmetic tests of varying content that were 
used in a New York city school testing program around 1910. These tests will be 
considered in more detail later. Test 8, in Lincoln's Table 106, a reasoning test of 
eight word problems, was administered to 13,629 boys and 13,542 girls (probably in 
Grades 4 to 8); based on an analysis of these data and discussed further below, mdo 
holds. Brooks [71] used Stone's reasoning test. то holds for 5 of 7 ages, 9 to 15. The 
results are in Table 34. Thorndike [72] constructed four arithmetic reasoning word 
problems. With substantial Massachusetts school personnel help, 4,640 children 
were tested. Reported were the percentage of boys exceeding the girls’ median. 
In 22 of 24 comparisons, in an unnumbered table [65, p. 61] for grades 6 to 9, 
boys’ percentage exceeded 50%, which might be a proxy for хь > xg. No index 
of variability is given. Interestingly, in a footnote, Lincoln reports that around 1917 
Cyril Burt “... found the same difference in London schools [65, p. 60] In yet an 
additional math reasoning test, spanning five grade levels, fourth through sixth, all 
10 comparisons favored boys. The index was the percentage of boys exceeding the 
girls’ median [65, p. 59]. 

From 1100 to 1200 New York city students of each sex were tested in each 
of 18 grade levels from forth to twelfth (e.g., 4A, 4B to 12A, 12B); results are 
given for two tests selected by Courtis which he thought revealed the greatest sex 
difference. Seventeen of the eighteen test medians, Table 27, favor girls on Test 3, 
a 120 item multiplication test of single-digit numbers (e.g., 3 x 4). Lincoln writes 
“There appears a very clear superiority in favor of girls [65, p. 57].” Turn the page 
for Table 28: for Courtis Test 6, a 16-item reasoning test, all word problems, Lincoln 
comments: “Неге we find the differences largely in favor of boys. . . [65, p. 58];” 14 
of 18 grades favored boys. These tables and associated commentary provide clear 
evidence, in large sample data, that early on it was recognized that girls exceeded 
boys on measures of central tendency for certain arithmetic tests. 

In summary, the early literature does provide evidence of mdo and mo holding 
for arithmetic reasoning tests. It also demonstrates that very early in achievement 
testing's history, there was good evidence that girls exceed boys on certain 
arithmetic tests. Should V for reading have been reported, it appears that many 
studies would have satisfied ro and likely rdo as well. 

As а historical note, it seems important to recognize E. L. Thorndike's impact 
on the early development of both math and reading achievement tests. From 
1899 forward, Thorndike spent his entire career at Teacher's College, Columbia 
University. He was Stone's statistical and conceptual advisor on math achievement 
testing. Gray, a pioneer in reading tests, while earning his BA in 1913 and doctorate 
in 1916 from the University of Chicago, earned a master's at Columbia University 
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in 1914, where he was influenced by Thorndike. Lincoln footnotes other Thorndike 
students, for example, Brooks [71]. For a very different perspective on Thorndike 
and his early influence, see Shields [73]. 
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Chapter 3 A 
Varying Viewpoints on Sex Differences TRICA 


Conceptual deficiencies, the surprising claim of no math test score sex differences, 
and other efforts to understand or explain math and other sex differences in task 
performance are of concern here. Those readers primarily interested in methods, 
models, and subsequent results may skip to Chap. 4 with little loss in continuity. 


3.1 Similarities and Hellinger Distances 


When Hyde in 2005 introduced her similarities hypothesis which “...holds that 
males and females are similar on most but not all psychological variables [74, p. 
581], she could have given the idea of similarities, an old psychological concept, 
a rigorous definition, and at the same time, proposed another more appropriate 
index of sex differences. But this did not happen. d although widely used then as 
now, was well-recognized, at that time, to be a conceptually inadequate index of 
sex differences at least for some tasks. The need to report BS ie in addition to d 
clearly implies that recognition, at least implicitly. If the population variances of 
the two sexes were plausibly regarded as the same, there would be no need to report 
variance ratios. Furthermore, much earlier Feingold in 1992 [28] presciently argued, 
as noted earlier, that to understand sex differences, both means and variances must 
be addressed. Feingold seemed to be calling for a new conceptual model for sex 
differences. Just how this goal could be achieved was apparently not obvious, as 
there appears to have been no proposal to do so in the subsequent decades. 

However, simply exchanging d and its model ô for an index that relaxes the 
population variance equality constraint would have been an important conceptual 
improvement. It might also have changed the psychology of how sex differences 
are viewed, away from the simple and misleading notion that sex differences are 
primarily, if not just exclusively, mean differences. 


© The Author(s) 2024 25 
H. Thomas, Sex Differences in Reading and Math Test Scores of Children, 

Monographs in the Psychology of Education, 

https://doi.org/10.1007/978-3-031-41272-1 3 


26 3 Varying Viewpoints on Sex Differences 


Instead, the notion of similarities and effect size d were tightly tied. In addition, 
there was no indication of just what variable attributes the similarities hypothesis 
were intended to capture other than effect size differences. Yet in the language of 
psychology, the notion of “psychological variable similarities" certainly extends 
well beyond the narrow effect size mean differences index [75]. The similarities 
hypothesis seems to have remained conceptually untethered to anything other than 
effect size and those often observed small values of |d| in various tasks. 

A more appropriate scalar index of sex differences that is easy to compute, 
assuming the variables of both sexes follow normality, is the Hellinger distance 
denoted as Н [76]. Н overcomes obvious deficiencies of d and unlike б, Н isa 
metric, that is, H is a measure of distance between a pair of probability distributions. 
It does require that the distributions of concern be specified, which is not required for 
6. But distributional normality is implicitly assumed seemingly always, in settings 
where d is of focus, so this added assumption seems benign. H does not require 
equal population variance for both sexes, an Achilles heal for ô and consequently d. 
Under 6 and thus d, two distributions are conceptualized as different by location- 
shift, obviously conceptually wrong at least for sex differences in reading and math 
test score distributions. H does not depend on how the distributions of boys and 
girls test scores are conceptualized as different. H is bounded in the interval zero to 
one, with zero indicating the distributions are identical and one if the distributions 
are disjoint. 

H and a function of H for use in what follows will be defined below in Chap. 4. 
As a scalar, estimates of H make it easy to compare the differences, in for example, 
boys’ and girls’ PISA math test score distributions for different countries, assuming 
V are available. Thus H can put the issue of U.S. test score sex differences into 
a broader global perspective. H could provide a unifying common sex differences 
metric that reflects variance differences even if variances were otherwise ignored. 
Any alternative index proposed is unlikely to replace d because d is too well 
ingrained in the fabric of research. But perhaps H can be reported along with d. 
H has a clear conceptual interpretation as a distance. d has no clear interpretation, 
at least for sex differences in reading and math test scores. 

One might observe that what Hyde seemed to have wanted, with her similarities 
hypothesis, was a metric or distance index on the separation of probability distribu- 
tions of boys and girls on whatever the current variable of interest happened to be. 
She writes "Crucial to meta-analysis is the concept of effect size, which measures 
the magnitude of an effect—in this case the magnitude of gender difference [74, p. 
582, italics added].” The word “magnitude,” which Hyde used repeatedly, usually 
carries with it in science at least the notion of a distance or metric, a property that 
the index she was advocating lacked. To write “measures the magnitude" clearly 
suggests Hyde wanted a “yard stick" which she did not have. 
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In 2019, Hyde and others [15] cite a large-scale meta-analysis [9] and conclude 
*... girls had reached parity with boys in mathematics [15, р. 177].” They report a 
second meta-analysis [10] which “. . . accumulated data from 242 studies, represent- 
ing the testing of more than 1.2 million people. Overall, d — 0.05, again indicating 
no gender difference [15, p. 177].” 

While these statements sound impressive and may seem initially compelling, 
sex differences in math achievement test score distributions have been recognized 
and well documented for more than 100 years [3, 70, 71]. Furthermore, data 
demonstrating these differences have long been available, as detailed in Chap. 2. 
Consequently, the claim that there are no gender differences in math test scores 
seems preposterous. 

However, statements that boys and girls are "equal" in math performance [25, p. 
377] show “no gender difference,” the differences are “trivial or nonexistent,” or that 
“parity” has been achieved [15, p. 177] or similar characterizations have appeared in 
publications spanning well more than a decade [9, 10, 15, 25, 32, 33]. Consequently, 
it is interesting to understand the basis for this remarkable “no gender difference” 
or “parity” in math testing claim, made by varying subsets of a set of 13 different 
researchers authoring the just referenced publications. 

First, consider what a “no difference” claim actually means. The precise intended 
meaning of the claim seems unclear because it appears to have has never been 
explicitly defined. However, their argument rests on average d values from large- 
scale meta-analyses. Thus the claim is properly seen as a conditional claim because 
d addresses only mean differences in distributions, ignoring variance differences. 
But no conditions are placed on their summary statements. For example, “Our 
analysis shows that, for grades 2 to 11, the general population no longer shows a 
gender difference in math skills, consistent with the gender similarities hypothesis 
[9, p. 495].” This statement makes clear their claim is unconditional. 

The only plausible approach to understanding is that the “parity” or “no gender 
differences” claims are to be taken as equivalent to the hypothesis that boys 
and girls share the same math achievement test score distributions. With this 
equivalency, the claims are easily dispatched without statistical concerns, using data 
they conveniently provide. Under the identical distribution hypothesis, as already 
noted, s / s, over studies, should fluctuate about one, while d should, over studies, 
fluctuate about zero. As discussed earlier in Chap. 1 in Table S1 [31] are listed 66 
variance ratios; 65 are greater than one. Under the parity hypothesis, 33 would be 
expected to be so. In Table 7 [10, p. 1132] which reports NAFP data (and is similar 
to Table 2.1 here) are listed 36 variance ratios, all of which are greater than one; 
only 18 would be expected. In the same table are listed 36 d values; 34 are positive. 
Only 18 positive would be expected. 

There is no surprise that this analysis shows the no difference hypothesis is false. 
While below their Table 7, the authors do state that “Overall, we conclude that a 
small gender difference favoring boys in complex problem solving is still present in 
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high school [10, p. 1132].” That is not the message the authors wish to convey. In 
their abstract, they state "Overall, d = 0.05 indicating no gender difference ...[10, 
р. 1132]^ 

The puzzle is how these authors justify a no sex differences in math test scores 
claim when they must be aware of where such an elementary and obvious analysis 
leads. The answer appears easily given. The first and most important fact is that 
they ignore data consistencies in their reasoning. The foremost feature of data to 
recognize in addressing psychology's crisis, according to Kagan [34], is patterns 
in the data. The remarkably consistent outcome patterns of d > 0 and 52 is > 1 
are clearly the most striking feature of their Table 7 [10, p. 1132] and immediately 
noticeable by anyone. Furthermore, this consistency is arguably the most striking 
feature of their article. This being so, it apparently does not matter how many 
small positive d there are or how many variance ratios greater than one there are, 
apparently for some readers and some of these authors, the no difference or parity 
claim is unshakeable. 

The second fact is that often no standard errors or other indices of uncertainty 
are given for their summary quantities, typically values of d. They simply report the 
statistic of interest and assert their belief. It needs to be acknowledged, however, that 
the construction of standard errors for meta-analysis may not be as straightforward 
as commonly believed. What are claimed as appropriate standard errors are often 
wrongly constructed. Shuster's research on these matters is convincing [77]. 

Set aside the state test data-based claim that “The weighted mean is 0.0065, 
consistent with no gender difference [9, p. 495];[31, p. 8].’ The fact that 65 of 
66 эў jus exceed one, in huge sample sizes, decisively falsifies any parity claim 
for these state math test data. This outcome consistency seems remarkable when 
it is recognized that how the different tests differ in math content is unknown, 
that the grade levels ranged from grade 2 to 11, and that likely there were wide 
ranges of differences in test administration, scoring, and recording procedures 
employed among the different states. Such consistency simply rarely occurs in the 
psychological literature. But given the fact that mdo widely holds, it is perhaps 
not surprising that a rejection of the parity idea for the state test data comes from 
variances, not means. It seems reasonable to suspect that the near-zero weighted d 
is because of the likely variability of the content of the state tests, keeping in mind 
that boys' advantage on math tests seems related to a test's reasoning component. 

Similar subjectively judged decisions appear elsewhere. Referring to an earlier 
posting of NAEP data, similar to Table 2.3, Hyde writes "For these items, at grade 
12, the average effect size was d — 0.07, indicating that girls had reached parity 
with boys even for complex problem solving at the high school level [25, p. 381].” 


And earlier in the same source appears “... girls’ math performance is equal to that 
of boys... [25, p. 377] Hyde reminds the reader she defined d < 0.10 as trivial 
[25, p. 379]. 


It is interesting to note that for the 31 math V, in Tables 2.1 and 2.3, all but 
one d are, by Hyde's definition, trivial. What is one expected to conclude? The 
answer apparently desired is that the remarkably consistent sex differences in the 
NAEP data gathered over decades, revealing small mean differences favoring boys, 
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and at three grade levels, using the most sophisticated protocol in U.S. history, and 
involving millions of children, are to be dismissed as substantively trivial. 

In summary, U.S. large sample representative data concerning sex differences in 
math testing show that for math tests with a reasoning component, the mean for boys 
is nearly always modestly larger than the mean for girls, and the variances for boys 
are nearly always far larger than the variance for girls. There appears no evidence 
this conclusion is inappropriate for nearly all international settings as well, at least 
for developed countries, as Table 2.5 reveals. 


3.3 Sex Differences and Searching for Answers 


Other researchers clearly recognize there exist, and have existed, sex differences 
in math testing, as well as reading. A key question is why such differences occur. 
Efforts to find definitive answers in test score data have largely failed. The most 
recent efforts have focused on math testing. As an example, Italian children's 
PISA math gap, which is the mean sex difference favoring boys, has been among 
the largest in Europe. To understand why, the 2013 Italian national assessment 
examination, the INVALSI, for school years 2, 5, 6, 8, and 10 was examined 
[61]. The “sample” was the entire population of school children's scores, more 
than 125,859 tests. Employing what they term as dynamic regression models, the 
effort yielded no definitive answers, but they conclude “... girls systematically 
underperform boys, even after controlling for an array of individual and family 
background characteristics, and that the average gap increases with children's age 
[61, p. 1].” In a rewrite, they suggest stereotypes play a major role in both reading 
and math test score sex differences. Although understanding the PISA math mean 
sex difference was the motivation for their study, they write in closing: "The analysis 
of the reasons why the gender gap in math exists and how it can be reduced is beyond 
the scope of our contribution [78, р. 39].” 

Stereotypes regarding girls and women have long been viewed as important 
variables for degrading girls' and women's test performance and thus explaining 
sex differences in math performance, as the previous [78] study illustrates. This 
perspective is a recurring theme in the math sex differences literature [79]. 
Stereotype threat has been well researched and has been shown, in some studies, 
to deleteriously impact on girls’ math test scores [80]. Boys’ reading scores can 
be similarly influenced [81]. However, after assessing 15 years of evidence, it 
was concluded stereotype threat is unlikely to importantly influence girls’ math 
performance [82]. Even if the results had turned out differently and stereotype threat 
were shown to be a viable and important variable, because variance differences 
were ignored, the results would be unable to address the core concern here: how 
the inequalities noted at the outset, in S, originate and why do they persist. 

Six variables are said to mediate sex differences in math [83]: 1-Stereotype 
threat, 2-ostracism and gender identification, 3-self-sufficiency, 4-teacher and parent 
factors, 5-math-related experiences, and 6-math test anxiety. Evidence is cited for 
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each. Collectively, it is argued “It has simply never been established that there is 
any meaningful and substantial sex difference in mathematics ability that is not 
massively confounded with factors related to individual experience [83, p. 42].” 
If true, these facts would seem to only magnify the importance of understanding 
the inequalities of focus here because they have managed to "poke through" the 
massive confounding claimed and nearly universally appear in the summary sample 
statistics. It is difficult, however, to see the relevance of such research on sex 
differences of primary concern here. That is because, again, of the failure to consider 
what are, after all, the largest sex differences, the variance differences. 

A multiyear Herculean effort reviewing more than four hundred publications 
culminated in Ceci, Williams, and Barnett [84, 85] and a corresponding lower- 
key book [5]. Their explicit, well-defined, and important main target goal was to 
explain the observed frequency participation differences among men and women in 
scientific and math-related occupations. This is not an explicit concern addressed 
here, but many issues they address overlap with issues of focus here. Yet at the 
end, it was not possible to find any explicit statement rendered by these authors 
regarding why such frequency differences between the sexes exist, and persist, and 
at different rates in different countries often with differences in languages and in 
different cultures. 

There is a literature, to be noted below, which specifically addresses similar 
frequency differences in success rates observed in boys’ and girls’ performances 
on certain cognitive tasks. Such findings, if they had been considered, could have 
been suggestive of corresponding sex differences in adults. Consequently, a different 
conclusion might have been provided than the one expressed: *... we believe that 
the evidence...points to nonbiological/ability factors as the major causes of the 
underrepresentation of women in mathematically intensive careers [5, p. xii; italics 
in original].” 

The strategy here, as noted earlier, is to explicitly focus on the specific set of 
empirical conditions of S, as the target for theoretical explanation. This narrow 
strategy may ultimately lead to greater overall understanding of sex differences than 
framing a far larger more encompassing set of goals. 


3.4 Genetics: The Oldest Recognized Sex Differences 
Influence 


Genetical X-linked influences have been recognized since the Talmud [86]. Fur- 
thermore, their trait influence is thought to be stable, with change occurring very 
slowly over long periods of time [87]. It is notable that interest in sex differences 
has probably always been triggered by frequency disparities in task performance, 
often most easily recognized in observational settings. 

O'Connor in 1943 [88] appears to have been the first to propose that X-linked 
gene influences could explain behavioral sex differences in performance unexplain- 
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able within alternative frameworks. Boys outperformed girls on his “wiggly blocks" 
task. In brief, given an X-linked gene in two alleles, if the recessive form has relative 
frequency q and is performance enhancing, then the proportion of q boys should 
exceed the proportion of q? girls. Or alternatively, the mean task performance for 
the boys should exceed the mean task performance for girls. 

However, not until the 1960 to 1980 interval were there efforts to address 
O'Connor's suggestion. The earliest approaches involved familial correlations. 
While a correlational approach cannot address the inequalities of focus here, this 
early history appears to have continued to shape perspectives today and in unhelpful 
ways. 

Under a bivariate discrete binary outcome X-linked Mendelian model (each 
marginal distribution is Bernoulli, with 0 or 1 outcome), the expected correlations 
among pairs of related individuals, for example, mother-daughter, or sister-brother, 
can be specified, given only the gene frequency. However, measurements on vari- 
ables possibly mediated by X-linked influences are typically observations on out- 
come variables such as achievement test scores typically regarded as continuously 
distributed. It was widely assumed, wrongly, that continuous bivariate test score 
correlations could serve as a proxy for estimating the desired discrete Mendelian 
correlations. The general failure of the correlations computed on continuous data 
to yield what was hoped to be estimates of the Mendelian correlations, provoked 
strong rebukes concerning even the possibility of X-linked effects. For example, 
"the validity of the hypothesis is unfounded [89]," or “Sex differences...Not an 
X-linked effect [90]." 

What went unrecognized was a conceptual problem: there was no basis for 
assuming a discrete bivariate Mendelian correlation could be estimated from 
correlations obtained from continuous bivariate test score model. Indeed, it was 
shown that attempting to construct a conceptual basis for doing so was essentially 
analytically impossible or at least intractable [91]. 

The consequence of this result implied these early correlational approaches were 
misguided and conceptually flawed. Nevertheless, the general failure of a correlation 
approach, along with the strongly worded, largely condemnatory spirit of the attacks 
appears to have carried over to present times. Indeed, the very idea that one might 
believe any trait might be attributable to a single-gene effect signals to some writers, 
even today, that such persons are beyond redemption and also are apparently stupid: 
“An education in psychology is not sufficient to overcome the appeal of single 
genes [92, p. 495].” It may well be these authors wish to reconsider their belief. 
A single gene TKTL1 may well be responsible for distinguishing cognitive abilities 
of modern humans from Neanderthals [93]. 

Besides the fact that genetical arguments are often, understandably, psycho- 
logically unappealing, there was another issue. Many X-linked recessive traits 
are deleterious, and some lethal, and suggesting a recessive gene might enhance 
performance seemed implausible, especially given the role of dominance in the 
theory of natural selection [94, 95]. This perspective has been revised, and today 
the role of recessive genes as having beneficial effects is an active area of research 
[96]. 
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The acknowledgment of X-linkage as a possible factor explaining sex differences 
remains largely absent in the contemporary psychological sex differences literature 
concerning reading and math, although not the wider literature addressing sex 
differences more generally. The possibility of other forms of genetical influence, 
namely polygene effects playing a role in observed sex differences, seems more 
widely acknowledged, in the psychological literature, although rarely explicitly 
embraced, at least in the recent psychological sex differences literature. 

An important notable earlier exception to this recent trend are the perspectives 
of Eleanor Maccoby and Carol Jacklin [53]. In their time, they were arguably the 
most scholarly, the most influential, authoritative, and unbiased voices on matters 
of sex differences. They argued, far more broadly than is being argued here, that 
the origins of psychological sex differences were determined by “...certain sex- 
linked biological predispositions, but to say so is not to deny the importance of 
social learning. [53, p. 275] Their entire book is framed within a biological 
perspective. Consider their book-ending section title: "Is Biology Destiny? [53, p. 
373].” They clearly embraced the likelihood of biological factors playing a role by 
setting boundary conditions: "A variety of social institutions are viable within the 
framework set by biology [53, p. 374] And they addressed explicitly, and in some 
detail, the possibility of recessive X-linkage as a possible sex differences factor in 
spatial task settings [53, pp. 121—122; 361]. 

Part of the reluctance to embrace genetical influences is understandable: one 
reason is there is often missing any explicit convincing mechanistic connection 
between general statements of genetical influences on sex differences, and the 
Observed sex differences that triggered the interest in the first place, such as the 
observed relative frequency differences arising on certain task performances, or 
the obvious lopsided participation rates among men and women in certain math- 
oriented occupations. A coherent conceptual framework is needed which accounts 
for these observational differences and which also accounts for the summary 
statistical inequalities. 

If math talent is facilitated by an X-linked recessive gene, all sons of math 
talented women, twelve Israeli university mathematicians, should be math talented. 
Math talent was defined as an SAT math score greater than 700. Among ten women 
who agreed to participate, all six boys were talented. Of ten girls, one was expected 
to be talented, but none were. Hypothesizing that each sex is equally likely to be 
talented is easily rejected p < 0.0004. While small, the study [97] is remarkable. 

With varying methodological approaches, X-linked explanations have accounted 
for observed sex differences in Piaget's water-level task [98, 99], the mental rotation 
task [100], variance differences in intelligence [101], but see [102]; in addition, 
there have been explanations for sex differences in performance on Witkin's rod 
and frame task [103] and gifted students’ math test performance [104]. 

More recently, there has been a wide recognition of the plausibility that X-linked 
genetical factors can contribute to phenotypical behavioral differences between boys 
and girls [87, 105—108]. In addition, the documented importance of X-linkage for 
understanding trait distribution continues to grow [109]. 
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Chapter 4 A 
Genetical and Y Models for Math PAEA 
and Reading 


This chapter details first the genetical model and then the probability model Y for 
math and reading. While the formality of the approach is distinctive, crucial to the 
endeavor is just how sex differences are construed. Rather than to focus on between 
sex differences, the focus is on within sex differences. This is the key to conceptual 
understanding. 

It is this focus on within sex differences, not between sex differences, which 
appears to set the approach apart from all other efforts to explain sex differences. 
Viewing matters from an effect size perspective is simply putting on a conceptual 
blindfold. Once these within sex differences are suitably modeled, the framework 
for understanding the between sex differences follows nearly trivially. 

Only the model specifications and results of the analyses appear below. The 
arguments appear in Appendix A. The most important analytical results state that 
the inequality orderings on the parameters of Y correspond with the empirical 
inequalities of mo and mdo in the case of math and ro and rdo in the case of 
reading. Consequently, the elements of S are simply expectations under Y. 

Two “toy” examples to hopefully aid intuition and illustrate the model’s use- 
fulness in both generating data and estimating parameters appear at the start of 
Chap. 5. The moment estimation procedure is detailed in Appendix А.З, and an 
R code implementation [110] is given in Appendix A.4. 


4.1 The Genetical Model 


“X-linked genes, especially escape genes...contribute to sex differences [105, p. 
241]" X-linkage, also referred to as sex-linkage, is applied here to math and 
reading test score sex differences in boys and girls. Recall boys have X and Y 
chromosomes, Y inherited from their father and X from their mother. Girls have 
two X chromosomes, one from their father and one from their mother. For girls, 
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the genes of one of these X chromosomes are mostly inactivated, but some genes 
escape this inactivation. Considering biallelic genes that escape inactivation, girls 
have genotypes AA, Аа, аА, and aa, while boys with a single X have genotypes 
A or a, with capital A denoting the dominant and lower case a the recessive 
allele. Assume for girls AA, Aa, and аА all lead to the same phenotypic test 
performance for girls as does A for boys. Assume a boys and aa girls display 
identical test score performance. Assume A and a have relative frequencies Pr(A) — 
p € (0,1), Pr(a) = q € (0,1), and p +9 = 1. Assume girls’ four genotypes 
AA, aA, Aa, and aa follow a binomial with (p + q)? = p? + 2pq + q? with 
Pr(aa) = q?. The pairs of probabilities of focus: for girls q? and 1 — q? and for 
boys q and 1 — q. 

In math, a and aa with probabilities q and q?, respectively, for boys and girls, 
are assumed to facilitate high math performance. They become coefficients of latent 
probability distributions in the probability model to explain sex differences in high 
math test score performance. For reading, a gene with allele frequencies 1 — q 
and 1 — q? is assumed to facilitate high reading performance for boys and girls, 
respectively, and these too become coefficients of latent reading distributions. Of 
course, different genes are assumed to be involved for each task. 


4.2 Model Y for Math 


The fundamental idea is that there are, for boys' and girls" math and reading test 
Scores, two latent unobserved subpopulations within each sex. The shapes of these 
two subpopulations define the shape of each sex's population, that is, each sex's test 
score distribution. One subpopulation is composed of high scoring individuals, and 
the other subpopulation is composed of low scoring individuals. It is assumed all 
individuals are members of one, and only one, of these subpopulations. Assume for 
the moment that all high scoring individuals, both boys and girls, have outcomes 
ио, While all low scoring individuals have outcome ші with шу < u2. ші and ио 
are outcomes of a binary two-outcome “Bernoulli-like” random variable B: Bp for 
boys and B, for girls. While Bj and В, have identical outcomes, their outcome 
probabilities are different. 

Now focus on math. The probability of By = u2 is q or Р(Вь = шо) = q and 
P(B, = ш) = 1 — q. For girls, P(B, = ио) = q? and P(B; = ш) = 1 – q?. 
So the probability of the B outcomes reflects the gene frequencies just given above, 
with the recessive gene facilitating higher test scores for math. 

Of course, the outcomes of test scores are not discrete binary outcomes 44, or 
u2. Rather they are usually regarded as continuous. Thus, another random variable 
is required, N: it is assumed to be independent of B, and it has mean zero and 
variance ø? > 0, and like B, and Bg, there is a pair, N, for boys and №, for girls. 
Np and №, have identical but unspecified distributions and reflect other sources of 
variance influencing test scores. 
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So far, all the above theory is latent and unobserved. What is observed is the 
additive composition of N and B. The result of this addition is that ш and u2 
become the means of two latent probability distributions which represent the two 
subpopulations. 

Define Y by Y = B + М where a realization of Y, that is, y, is a math test score 
and where 


Үр = Bp + Np math for boys, 
Ү=В+М= (4.1) 
Y, = Bg + N, math for girls. 


With E(-) denoting expectation and var(-) denoting variance, 


Е(№) = иь = (1 — q)ii + qua and E(Y,) = ug = (1 — 4?) q^ po. 
var(Yj)) = of = q(1 — q)(u2 — ш)? + o?; va(Y,) = о? = q?(1 — q?)(u2 — 

ш)? + о?. 

Only realizations of Y, and Y, are observed. Importantly, the only sex differences 
are the discrete outcome probabilities for By and В, given above. The distributions 
of Y, and Y,, along with three inequalities, may be given. 

The Y; and Y, distributions are 


Оу) = (1 — q) fiG) + 4/5 (у) the distribution for boys 


and 


fe(y) = (1- аЛ (у) + а? foy) the distribution for girls. 


fk) = fii Uk, o2), к = 1,2, and f is the distribution of N, and Nz; f can be 
either continuous or discrete and is otherwise unspecified. Please see Appendix A.1 
for the argument. 

The distribution of Y; is not the same as the distribution of Y, plus a constant, 
provided 0 < q < 1. Thus, Y is not a location-shift model. The right sides of fp (y) 
and f, (у) are latent and unobserved. But their parameters q, ш, шо, and o ^ can be 
estimated, as will be illustrated in the next chapter. 

The corresponding lower tail distribution functions Рр (у) for boys and А, (у) for 
girls are (if y is continuous) 


y y 
Fo(y) = a-o f fwdw+a f Р(ш)аш. 


y y 
F,(y) = (1—4) | fiQw)dw +q? | fo(w)dw. 


Again, (у) = f (y; ик, 02), k = 1, 2, and f is the distribution of Nj and Ng. 
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The population analogs of the empirical inequalities то and mdo are given in 
the following three inequalities: 


E(Y5) = иь > Ug = E(Yg) (4.2) 
var(Yp) = of > о; = var(Y4) and op > ag if0 <q < .618 (4.3) 
ор — 92 > pp — Ше given that (4.2), (4.3) and L hold. (4.4) 


L = {ua = u3 —pi1 > 1&0 <q < ([5 — 4/n4]!? — 1)/2 < .618). L is a weak 
condition, requiring only that ио — ш > 1 and like (4.3) fixes an upper bound on 
the size of q. Please see Appendix A.2 for the arguments. 

Inequalities (4.2), (4.3), and (4.4) reveal Y is easily falsifiable. In (4.4) that 
(4.2), (4.3) and L hold. Thus, failures in data of mo or mdo to hold might be 
taken as falsifying Y, although sampling variability must be considered. Clearly, 
by inspection, E(Y,) and уаг(Уъ) share common parameters and are functionally 
related and similarly for girls. 

The key point in the above development is that Y generates the same analytical 
inequality orderings, in the model, inequalities (4.2), (4.3), and (4.4), as are observed 
in data, namely mo and mdo. Thus, under Y, mo and mdo are expected outcomes. 


4.3 Model V for Reading 


The reading model is analytically similar to the model for math. For math, the 
recessive gene is assumed to code for higher scores. For reading, the recessive gene 
with frequency q is assumed to be deleterious and thus codes for lower reading test 
scores. From a model perspective, the values of q and 1 — q for boys and q? and 
1 — q? for girls are interchanged wherever they appear: thus, д^ is interchanged 
with 1 — q? for girls, while 1 — q is interchanged with q for boys. Two inequalities 
change: ug > шь and оў — o? > Ша — Hb. The distributions fp(y) and fg(y) for 
reading simply interchange their right-side coefficients of ў; (у), k = 1,2. As with 
Y for math, the observed inequalities ro and rdo are simply expectations under Y 
for reading. 

Different genes are assumed to underpin performance in math than underpin 
performance in reading. q is used to denote the gene frequencies in either case 
because, in almost all cases, no confusion should result in whether the q refers to 
reading or math. Occasionally, q may refer to either or both tasks. When confusion 
seems possible and the distinction is important, qm and q,, respectively, for math 
and reading q will distinguish the settings. Hats denote estimates for all parameters, 
e.g., ў for the q estimate and 6 for the с estimate. 
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4.4 Hellinger Distance H 


The Hellinger distance [76] will be used to measure the distance between the 
probability distributions of boys and girls in some illustrative examples in Chap. 5. 
H is a metric, that is, a distance, unlike ó the model for d. ó assumes equal 
population variances while H does not. For continuous data, the square of the 
distance between the boys' and girls' probability distributions is 


н? =1- | Голое. 


H is zero if the two probability distributions are identical, and H is one if the two 
probability distributions do not share the same support; that is, their distributions 
over the horizontal axis are disjoint. For some continuous distributions, including 
the normal, there are algebraic expressions requiring only the parameters and which 
do not require integration [76]. 

Boys’ and girls’ test score distributions are usually quite similar and often are 
nearly identical. They share the same support, that is, boys and girls share the same 
test score outcomes. The result is that the estimated H values are often very small. 
Consequently, reported below are estimates of Hd — 100H. Hd remains a distance, 
scaled between zero and one hundred. 
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Chapter 5 
Model Estimation and Illustration of Test ge 
Score Distributions 


Parameter estimates under Y are provided for several reading and math test V and 
their estimated test score distributions from a variety of settings around the world, 
along with their graphic portrayals. 

While model estimates can be provided for nearly all V which satisfy mo or ro, 
the only samples to be considered below are those that appear to be representative 
of the larger population. V for SAT tests, although as noted in Chap. 2 well satisfy 
mdo, will not be considered. That is because people who take the SAT choose to do 
so, and so they certainly do not reflect a more general population. 

АП data explored below are either contained herein or are easily accessible 
online. As noted earlier, the estimation algorithm is given in Appendix A.3, while 
an R code implementation [110] appears in Appendix A.4. Those V chosen as 
examples below illustrate similarities and contrasts for difference settings. 


5.1 Math Examples 


To assist understanding of Y, defined as Equation 4.1 and discussed in Chap. 4, are 
two toy examples. Both masquerade as math test score settings. And both illustrate 
the ease of generating data under Y and the corresponding estimation of parameters. 
Implementation using coins and poker chips can cause some tedium. 


5.1.1 Example 1: Coin Tosses 


On one side of each of two pennies, mark 0, and on the other side, mark 10. On 
one side of a nickel, mark 5, and on the other side, mark —5. To simulate a virtual 
boy's test score, flip one penny and the nickel. The penny plays the role of the 


© The Author(s) 2024 41 
Н. Thomas, Sex Differences in Reading and Math Test Scores of Children, 

Monographs in the Psychology of Education, 

https://doi.org/10.1007/978-3-031-41272-1 5 


42 5 Model Estimation and Illustration of Test Score Distributions 


random variable B, with outcomes 0 and 10, each with probability q = 1/2. The 
nickel plays the role of the random variable Np with outcomes —5 and 5, each with 
probability one-half. Record the sum of the observed outcomes of both coins. This 
sum is a realized value y of Yp from the model Үр = By + №. Now repeat 100 
times, thus obtaining “test scores" for 100 virtual boys. 

Perhaps the following may further aid intuition: suppose the penny tossed for 
boys lands 0. Then following the nickel’s toss, the outcome of the summed values 
is either —5 or 5. Should the penny have landed 10, the corresponding summed 
outcome following the nickel's toss would be 5 or 15. Thus, the two penny 
outcomes, 0 or 10, are each the centers of two distributions each with variance 
25, the nickel's variance, resulting in a two-component mixture distribution with 
component means 0 and 10 and component variance 25. Each component's weight 
coefficient is one-half the penny's outcome probabilities. This paragraph completely 
specifies a discrete two-component mixture distribution. 

For girls, flip all three coins. If both pennies show 10, take the value shown on 
the nickel, add 10, and record. If the outcome for the pennies is otherwise, simply 
record the outcome shown on the nickel. The two pennies play the role of By with 
outcome 10 with probability q? = 1/4 and zero with probability 1 — 42 = 3/4. The 
nickel plays for girls the role of Ng. The result is a realized value y of Y; = Bg + Ng. 
Repeat the procedure 100 times, recording the data for 100 virtual girls. The possible 
outcomes for both sexes are —5, 5, and 15. One V realization is 


V = (xp = 5, x, = 2.400, sp = 7.247, s, = 6.454, np = 100, ng = 100}, 


which satisfies mdo. Below are the estimates, with hats, and the parameter values 
in parentheses: 


â = 0.396(q = 0.5), 

j? = 0.157(q? = 0.25), 

йл = 0.694(u4 = 0), 

йо = 11.562(u2 = 10), and 
6? = 25.142(02 = 25). 


5.1.2 Example 2: Poker Chips in Three Urns 


Three urns are labeled Bj and B, the boys’ urn and girls’ urn, respectively, and 
М. By contains 40 chips labeled 4 and 60 labeled 2; В, contains 16 chips labeled 
4 and 84 chips labeled 2. N contains 50 chips labeled —2 and 50 labeled 2. For 
boys sample with replacement, one chip from urn Вр and one from urn N (which 
may be thought of as a value of Np) add the numbers, save the result, and repeat 
100 times. Repeat the process for girls except urn Bg is used. Thus, there are 100 
virtual “observations” for each sex. The possible outcomes are 0, 2, 4, and 6. A V 


5.1 Math Examples 43 


realization is 
V = [xp = 2.780, x, = 1.980, sp = 2.272, Sg = 1.980, np = 100, ng = 100}. 


The elements of V satisfy mdo. The parameter estimates have hats with parameter 
values in parentheses: 


g = 0.355(q = 0.400), 
j^? = 0.126(q? = 0.160), 
йл = 1.540, (ш = 2), 
йо = 5.033(и2 = 4) and 
62 = 2.А4А71(о? = 4). 


Y for each sex is a mixture model with two latent component distributions fj (у) 
and р (у). The two most important decisions usually encountered in constructing 
mixture models are first, the specification of the number of components. Here 
there are always two for each sex. The second concern is the specification of 
the latent unobserved component distributions. In most mixture models, the latent 
components are assumed to be normal distributions. However, typically there is no 
theory or empirical basis for justifying their specification. A wrong specification 
of the component distributions can lead to misunderstandings and inappropriate 
conclusions. There is usually no easy way to assess the consequences of wrongly 
specifying the mixture components, and thus criticisms or concerns that the 
components might be wrongly specified are difficult to counter. This problem is 
perhaps the central concern of many critics of the latent variables mixture model 
approach to understanding individual differences. 

This concern is not relevant here because under Y, as noted earlier, fı (у) and 
/» бу) the component distributions do not require specification for their parameters 
to be estimated. They can be any probability distribution, discrete or continuous. 
Consequently, Y consists of a pair of weak semiparametric mixture models. 
However, to construct graphs of the corresponding solutions provided below, the 
components must be specified. Thus post estimation, when component distributions 
graphically appear, they are constructed assuming the component distributions 
fiCy) and р (y) are normal distributions. 

An additional unusual advantage of the current estimation procedure, very unlike 
most solutions of finite mixtures, is that the solutions are closed form. That is, the 
solution output will always be the same with the same V input. In conventional 
mixture solutions, unlike the special situation here, and where the components 
are assumed to be normal distributions, iterative procedures often with randomly 
specified starting conditions are required. Consequently, the same mixture solution, 
given identical inputs with each solution, cannot be guaranteed. 
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5.1.3 Example 3: Italian Math Test 


Considerably, more detail will accompany this example which may serve as a guide 
for interpreting the remaining examples for both math and reading. Cascella in 2020 
[60] reported the large sample testing of Italian children, Grades 5 and 10, using the 
Italian INVALSI math achievement test. For both grades, mdo is satisfied. Grade 5 
children, aged 10 years, are considered here. 

The starting point is the summary statistics data vector V the only information. 
V = (xy = 198.913, x, = 191.881, sp = 43.058, s, = 39.845, np = 15453, n, = 
15415}. 


V = (198.913, 191.985, 43.065, 39.837}. 


V is given here for easy comparison with V and will be explained momentarily 
below. Below are the parameter estimates with standard errors in parentheses: 


g = 0.178(0.023), 
ĝ? = 0.032(0.010), 

Йал = 190.362(0.572), 
йо = 238.492(4.34), and 
6? = 1516.197(21.59). 


Define X-linked heritability for boys as 
hj. = Vat(By)/Var(Yp) 


and similarly for girls. he = 0.182(0.015) and for girls hs — 0.045(0.007). 

Once the parameter estimates are obtained, the predicted values are reported in 
V. These values may be compared with the observed values in V. The differences 
between V and V may be taken as an index of how well the model accounts for 
the data. The elements of V are the estimated values of the sample means and 
sample standard deviations, given the model estimates, and are computed from 
the parameter estimates where the estimates replace the model parameter values. 
Thus, these are approximately the expected values given the model estimates. The 
expressions to do so are given above in Chap. 4. 

For example, the boys’ predicted value of y, which is 198.913 in this example, is 
the estimated expected value 


EY») = (1—4) + Gia. 
The girls’ predicted standard deviation s, is \/var(Y,) with 


Ye) = (йо — A1) A — 42) +8. 
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Fig. 5.1 Example 3. Thick lines are ГА (y) for boys and Js (у) girls, and thin lines are their two 
components. The ordinates are at 21 and 2 


In this example, it is 39.837. As the above comparison of V and V reveals, the 
model estimates are in excellent agreement with the data. 

Figure 5.1 provides a graphical solution with the post estimation assumption that 
fı(y) and f2(y), the latent components, are normally distributed. The estimated 
means of these two latent component distributions are shown by the vertical 
ordinates at the locations of ji, and ji», the estimated component means. The 
component distributions for both boys and girls share the same standard deviation, 
ô = 38.94. The two bold lines, dashed and solid, are the estimates of f(y) and 
fg), namely the estimated population models for boys and girls, respectively. 
If the test scores of the 30,868 children were available, these bold lines should 
resemble the histograms for boys and girls separately if the latent normal component 
distributions are reasonable specifications. The bold lines are determined by the 
thin lines of the unobserved components. There is a pair of thin lines for each sex. 
The two much smaller, thinner lines in the upper tails represent the distributions 
for boys (dashed lines) and girls (solid lines) for higher scoring latent component 
distributions. The shapes of these component distributions for boys and girls are 
identical; the girls’ component is smaller because the weight à? associated with this 
component is smaller for girls than the g weight for boys. 

In particular, these two upper tail components with their estimated coefficient 
weights are â? f (y) = 0.032 fa (y) for girls, and for boys it is 8 f(y) = 0.178 (у). 
fi (y) and h (y) denote the probability distributions with estimates replacing their 
parameters. The other pair of thin lines, in the lower tails, are those associated with 
the boys’ and girls’ latent lower score component, ў! (у). The girls’ lower scoring 
component thin solid line is graphically indistinguishable in some regions of the 
graph. Precisely, these are (1 — 42) fi (у) = 0.968 / (y) for girls and (1—4) (у) = 
0.822 fl (y) for boys. The girls’ lower scoring weighted component line tracks the 
girls’ h (y) closely because the lower component weight 0.968 is nearly one. The 
boys’ lower scoring weighted component tracks ГА (y) less closely іп most of the 
regions of the graph because the lower score component weight for the boys departs 
more sharply from one, namely 0.822. 
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The upper tail of ГА (y) for boys clearly shows more probability mass than 
the corresponding upper upper tail region of f (y) for the girls. It is commonly 
observed, e.g., [11], that upper tail region for boys’ distribution “is fatter” than the 
distribution for girls. This figure, Fig. 5.1, illustrates that Y can explain this fact. An 
analytically based explanation will be provided later in Chap. 6. 

In Fig. 5.1, the lines associated with the lower scoring weighted latent probability 
distributions, for both boys and girls, extend well above 250. Neither the model, Y 
nor the data imply, as mistakenly might be thought, that higher scoring boys or girls 
only come from the higher scoring component. While probabilities depend on the 
test score taken as the reference point (the lower limit of an integral), in fact, in this 
example, more higher scoring boys and girls can come from the lower than from 
the higher scoring components. To make this important point precise and using the 
equations defined in Chap. 4, the total estimated proportion of girls above 2 7: 238 
is 0.123 2 1 — F, (238), while (1 — 4?) dos Л (y)dy = 0.107 the area under 
the girl's weighted lower scoring component but above 238 and 42 dos Ê (y)dy = 
0.016 the area under the girl's weighted higher scoring component and above 238. 
Thus, by far the largest proportion of higher scoring girls, those with scores above 
238, comes from the lower scoring, not the higher scoring component, and in this 
example, similarly for boys as well. 

Keep in mind these probabilities are estimated under the assumption the latent 
components have normal distributions. The reality may be somewhat different. To 
repeat, this example illustrates the important fact that higher scoring children come 
from both mixture components. 

One can estimate the number of girls and boys in these upper components by 
computing j?n g ^ 486 girls while @пь ~ 2745 boys. The X-linked heritability 
estimate for girls is very small, 0.045, a tiny proportion of total variance for girls: 
Var(Bg) + Var(Ng) = 1587.010 with Var(B,) = 70.812 and Var(N,) = 1516.197. 
For boys, the X-linked heritability is much larger 0.182, but still relatively small. 
Thus, the proportion of X-linked variance accounted for in Y is a small fraction of 
the total variance for both sexes, a finding entirely expected. 

The probability that a given girl's test score y “comes from" the higher scoring 
second component can also be estimated: 


PAO) 


P (component 2|y)2— ; 


However, to do so requires ў (у) and f(y) to be specified, perhaps as normal 
distributions. 


5.1.3.1 Standard Errors 


Standard errors associated with the parameter estimates are reported above in this 
example because the elements of V seem to have been the result of random sampling 
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assumptions, at least approximately. These standard errors were obtained using the 
parametric bootstrap [27, р. 53], assuming that post estimation, the component 
distributions fı(y) and f2(y) are normal in distribution. Whenever standard errors 
appear, they were obtained by the parametric bootstrap, the components are assumed 
to be normal, and the elements of V are assumed to be obtained by random 
sampling, at least approximately. 

Standard errors for parameter estimates obtained from large sample survey V, 
for NAEP or PISA, are not given. For one, there are no sample sizes associated with 
the estimates in V, and for another, estimates in V are far removed from random 
sample estimates. These facts were noted in Chap. 2. There is simply no guidance 
on how one might proceed to construct a standard error for, say, d estimated from 
NAEP data, and thus no standard errors appear. 


5.1.4 Example 4: CogAT or Cognitive Abilities Test 


The CogAT is a multiple choice test, with varying types of number puzzles, 
analogies, and series. This example employs an earlier version of the CogAT from 
1987 [111] and involved eighth grade children. The observed V which satisfies mdo 
and predicted V are 
V — (101, 100, 16, 14, 5085, 5148], 
V — (101, 100, 16.091, 13.896]. 


Below are parameter estimates with standard errors in parentheses: 


4 = 0.015(0.008), 
4? = 0.0002(0.0003), 
йл = 99.98(0.19), 


fa 
Йо = 166.84(38), 

62 = 192.08(5.6), 

һу = 0.258(0.037), and 
һ, = 0.005(0.003). 


The X-linked heritability for girls is vanishing small, but for boys much larger. 

Figure 5.2 graphically illustrates this solution, which is very unlike Fig.5.1. 
It reveals no clear visual separation between fo) and RO) nor among the 
latent component distributions either. That is because 4 = 0.015 is very small, 
so the higher component probability weights for both sexes are near zero, driving 
these higher scoring components toward zero for all y. Specifically, for girls 
â? (у) = 0.0002 f» (у), and for boys 4 (y) = 0.015 (у). Thus, the weighted 
first component distributions for both sexes and fo) and RO) all graphically 
essentially coincide. Some evidence of the second component boys’ distribution is 
just noticeable in the right tail of Fig. 5.2. 
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Fig. 5.2 Example 4. Bold lines are ho (y) for boys and f (y) girls. The second higher component 
distribution for boys is just graphically distinguishable. The lower scoring component lines merge 
with the fp(y) and fg (у) distributions and are not graphically distinguishable 


Compare the bootstrap standard error associated with Д of 0.19 with the 
standard error associated with 12 of 38. The uncertainty regarding the location of 
the second component mean u2 is large. That is because there are few observations 
expected to be in the extreme right tails of the distributions on which to base the 
location of u2 and so the uncertainty with respect to the location of u2 is large. One 
only expects about 1 ~ 0.0002 x 5,148 girl and about 77 ~ 0.015 x 5,085 boys in 
the right higher scoring components. 

Even though a glance at Fig. 5.2 suggests there are no sex differences in math, the 
inequalities in V require an explanation. While x, and x; differ by only one, they 
are more than three standard errors apart. V is close to V, so Y provides a coherent 
explanation, although the sex differences are not evident graphically. 


5.1.5 Example 5: Stone's 1908 Math Tests 


As noted at the outset, Stone [3] constructed two tests: a fundamental test the results 
of which were displayed in Fig. 1.1 and the other one an arithmetical reasoning 
test. Both tests were intended for use with sixth grade children. There are 14 
numerical fundamental items which assessed addition, subtraction, multiplication, 
and division. The first of which was the addition of the following numbers (arranged 
in a column): 2375, 4052, 6345, 260, 5041, and 1543. Problem 14: Multiply 96879 
by 896. There were 12 reasoning problems. The first is “If you buy 2 tablets at 7 
cents each, and a book for 65 cents, how much change should you receive from 
a two-dollar bill?" The items of both tests were presented in an increasing order 
of difficulty, with the sequence determined at least in part, by earlier pretesting. 
Children were given 12 minutes for the fundamentals and 15 minutes for the 
reasoning test. The number and ages of children tested were not specified, but 
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comments suggest about 3000 sixth graders received both tests in 26 different school 
systems. 

The scoring for both tests employed item weighting which was based on 
Thorndike's belief that "arithmetic is but an abstract name for a number of partially 
independent abilities [3, p. 20].” Weights were designed to reflect the difficulty of 
the item. The precise way by which weighting was determined seems unclear. In 
the end, scores for the 12 reasoning items ranged from 0 to 15.2, while for the 14 
fundamental items, scores ranged from 0.3 to 6.3. 


ec 


Stone provided data from 250 boys and 250 girls, “...500 pupils chosen at 
random from four representative public school systems [3, p. 30]" in his Table X 
and Table XI. These data, aggregated over the four school systems from his two 
tables, are given in Table 5.1 for the fundamentals test and in Table 5.2 for the 
reasoning test. The data from Table 5.1 provided the basis for Fig. 1.1. 

Stone's effort was impressive for the time. He had to score, weight items 
of individual tests, apparently all by hand, with medians replacing means, and 
Thorndike's AD or average deviation replacing the standard deviation. He credits 


Table 5.1 Stone's (1908) Score | # Boys | # Girls | Score | # Boys | Girls 
Fundamental Data 

0.3 1 0 3.4 9 8 
0.9 0 2 3:5 3 7 
1.1 4 2 3.6 3 2 
1.2 1 2 3:7 3 1 
1:3 2 2 3.8 3 4 
14 2 4 3.9 6 8 
1.5 4 3 4.0 9 6 
1.6 2 2 4.1 9 4 
1.7 3 3 4.2 4 2 
1.8 2 3 4.3 3 1 
1.9 10 10 4.4 3 3 
2.0 7 10 4.5 4 2 
21 6 11 4.6 4 4 
2.2 3 6 4.7 6 5 
2.3 6 9 4.8 4 5 
24 5 6 4.9 3 4 
2.5 4 3 5.0 5 5 
2.6 7 7 5:1 1 2 
2.7 9 11 5.3 3 1 
2.8 7 10 54 2 1 
2.9 15 15 5.5 2 0 
3.0 28 24 5.6 1 0 
3.1 9 11 5.8 0 1 
3.2 9 6 6.3 1 0 
3:3 13 12 
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Table 52 Stone's (1908) Score | # Boys |#Girls | Score | #Boys | # Girls 

Reasoning Data 0 4 5 6.0 3 6 
10 |1 8 61 | 2 2 
13 1 0 6.2 | 11 6 
18 l 0 64 |29 14 
20 |6 7 66 |8 1 
22 | 0 3 68 | 0 2 
23 |1 1 69 [1 0 
24 | 0 1 70 | 6 3 
25 |o 1 71 |2 0 
26 | 0 1 EHE? 2 
2.8 1 0 7.4 1 1 
3.0 12 15 7.5 0 10 
3.2 1 2 7.6 | 20 10 
3:3 1 1 7.7 (0) 1 
3.4 1 2 7.8 10 2 
3.6 0 4 8.0 7 4 
37 |0 1 &1 | 0 І 
4.0 12 23 8.2 4 4 
4.2 1 2 8.4 0 2 
43 |0 1 8.6 | 1 2 
44 | 2 2 92 |19 9 
4.5 1 0 9б |2 1 
4.6 2 5 9.8 5 0 
47 |0 2 100 | 0 1 
48 [3 2 102 |5 3 
49 |I 1 mi [1 1 
5.0 | 20 19 i12 |8 3 
51 |0 1 116 | 1 0 
52 |7 11 122 n 0 
53 | 0 1 127° a 0 
54 | 17 13.2 | 1 1 
5.6 5 6 14.3 1 0 
5.8 3 2 15.2 1 0 
5.9 1 1 


his wife for doing much of the statistical analysis. Much of thrust of his work 
was to compare test performances of different school systems and their relative 
performance on the two tests. For example, he compares the ratio of boys to girls 
in their variabilities, using ratios of AD/median which he called a coefficient of 
variability. Ratios of these quantities defined his variability index. These ratios are 
reported for smaller numbers of boys and girls for several different school systems. 
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His fundamental test results satisfy mo: 
V = (3.193, 3.026, 1.048, 0.996, 250, 250} 
with 
V = (3.193, 3.026, 1.059, 0.984} 


close to V. The estimates are 


à = 0.175, 

ĝ? = 0.031, 

fii = 2.991, 

йо = 4.147, 

6? = 0.929, 

һу = 0.172, and 
hi. — 0.041. 


Figure 5.3 displays the model estimates for Stone's fundamentals test. The higher 
scoring latent components for both boys and girls are visible in the figure, which 
resembles Fig. 5.1. 

Stone's reasoning test results satisfy mdo: 


V — (6.486, 5.440, 2.532, 2.316, 250, 250], 


while 


V = (6.486, 5.440, 2.537, 2.312}. 


The estimates are 


Fig. 5.3 Stone’s Fundamentals Test, Example 5, which may be compared with Fig. 1.1 
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fii = 4.393, 
Йо = 8.579, 
a” = 2.057, 
һу, = 0.680, and 
һу = 0.615. 


The reasoning test estimate of q is far larger than the fundamental test estimate or 
estimates from the previous two examples. The estimates reflect, of course, weights 
Stone assigned to each child's test score answer. 

Stone stated his primary goal was “...to extract knowledge of the relation 
between distinctive educational procedures and the resulting products [3, p. 7]” 
Since his primary interest was in educational procedures, why would he have 
chosen to provide, as the only raw data tabled, data on boys and girls separately, 
when numerous other tabulations more central to his stated goal were possible? A 
speculation is that he observed sex differences in test performance among the 3000 
boys and girls he observed in the 26 school systems, in the six chiefly north eastern 
states he visited; perhaps it was this information he wished to preserve, even though 
he could not do the analysis he might have wished to do. 


5.1.6 Example 6: The Courtis Arithmetic Tests 


In 1909, the Russell Sage Foundation of New York City issued a damning report 
concerning the city's school system “... which indicated that retardation was costing 
the city millions of dollars annually [112, p. 53] By any measure, the consequence 
was extraordinary. A huge investigation was mounted which considered numerous 
aspects of the educational operation, covering janitorial services, school layout, 
salaries, teachers' ethnic backgrounds, and more. The concern here is with the only 
assessment of school achievement, the eight arithmetic tests developed by Stuart A. 
Courtis. At the time, he was in charge of arithmetic at the Detroit Home and Day 
School, later the Liggett School, which continues today as the University Liggett 
School. Later, Courtis joined the University of Michigan as an education professor. 

The background and motivation for his tests are set out in five articles spanning 
the years 1909 to 1911 [113-117]. Subsequent references in this section are all to yet 
another publication, covering the interval 1911 to 1912 which is the interval Courtis 
provides for in his report [70]. Courtis constructed all the tests [70, pp. 402-414], 
supervised the testing and data analysis, and wrote the portion of the final report 
on testing. It contains more than 150 detailed pages with 50 hand-drawn figures 
or graphs and 55 tables some spanning multiple pages, with seemingly countless 
computations, mostly averages. It is a remarkable effort given the limited technology 
of the period. Interestingly, Courtis declined compensation and received only his 
expenses [112, p. 58]. 

Testing involved 33,350 children. The target grades were fourth to eighth, with 
the addition of two high schools, so data are available through Grade 12. Testing 
occurred in 903 classes in 52 schools in all five New York city boroughs. 
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The tests covered basic operations and arithmetic reasoning. Collectively, they 
were difficult tests. Tests 1 to 4 assessed addition, subtraction, multiplication, and 
division, respectively, each having 120 items. Test 5 required copying 240 numbers; 
Tests 6 with 16 items and Test 8 with 8 items were word reasoning tests. Test 7 called 
fundamentals presented a mixture of arithmetic items requiring all four operations, 
some very difficult. One item required adding eight five-digit numbers. Children 
were allowed one minute on tests 1 to 6, certainly speed tests. The test forms were 
labeled as speed tests. 

Discussion will focus on those tests which, to varying degrees, clearly address 
sex differences in performance. These are Tests 1, 3, 6, 7, and 8. V can be 
constructed from data provided by Courtis for one grade level for Test 6 and an 
aggregate of more than 26,000 children for Test 8. The reasons Tests 6 and 8 can be 
used for analysis, and not the others, is because they had limited numbers of items, 
for which exact success frequencies were given. Data for the other tests required 
grouping the scores into intervals, and in addition, for some intervals data were not 
available. Testing involved 18 grade levels, 4A and 4B to 12A and 12B, and at each 
grade level from 1100 to 1200 boys, and a similar number of girls were tested. 
However, grade level data for both sexes are given only sparingly. One can trust 
the averages reported. That is because Courtis writes that the averages are aided by 
*... mechanical tabulation made the average. .. easily computed [70, р. 432].” AD, 
average deviations, approximations which replaced standard deviations, followed 
Thorndike [64]. 


5.1.6.1  Courtis Tests 1 and 3: Addition and Multiplication 


Each test has 120 pairs of single digit numbers, for example, (0, 8) or (2, 5), which 
were vertically arrayed with answer line below. Items were arranged with six rows 
of 20 items in each row. For Test 1 children added; for Test 3 they multiplied. Data 
indicate that in mean, girls generally well exceed boys for both tests at least for the 
grade levels reported. Data are not reported for all grades and all tests. Data reported 
are based on the number of items correct for each child. For Test 1, grade 5A girls 
averaged 49.1, while for boys 47.6 [70, p. 529]. The data are far more convincing 
for Test 3, multiplication, where averages are given for each sex for eighteen grade 
levels with means for boys and girls at each grade. In all cases, except for 12B, girls 
exceed boys in mean [70, p. 525], a fact noted earlier above. At no grade level for 
Test 3 did the mean exceed 52, less than half the number of items on the test, an 
indication of the test's difficulty. 


5.1.6.2  Courtis Test 6: Simple Reasoning 


The first item of this 16 item test reads “The children of a school gave a sleigh- 
ride party. There were 9 sleighs used, and each sleigh held 30 children. How many 
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children were there in the party? [70, p. 408] As noted earlier, Lincoln [65, p. 58] 
shows girls exceed boys at two grades in Test 6 mean, they are tied at three grades, 
and boys exceed girls at 13 grade levels. A graph [70, p. 527] with mean correct 
graphed against grade level values 3.5 to grade 12 shows a slightly different Test 6 
result with boys exceeding girls at all ages except grade 11. 

Courtis provides frequency distributions of scores for Test 6 for grade 7B [70, p. 
526] and grade 8A [70, p. 446]; the 8A distribution shows boys with higher mean 
scores but fails mo. For the 7B data which satisfy mo, 


V = (3.714, 3.090, 1.935, 1.823, 1235, 1168}. 
Clearly, the test was difficult, with averages less than 4 for a 16 item test. 


V = (3.714, 3.090, 1.936, 1.822) 


ĝ = 0.487, 

ĝ? = 0.231, 

Йі = 2.497, 

йэ = 4.995, 

8? — 2.189, 

һу = 0.416, and 
hs. — 0.341. 


The estimate of q is quite large and is similar to  — 0.500 for Stone's reasoning 
test. 


Figures 5.4 and 5.5 display the histograms for boys and girls, respectively. The 
distributions reveal substantial probability near zero with a sharp bound there. 
Normal distributions are clearly unsatisfactory as latent distributions because of this 
sharp lower bound, and hence no latent distributions appear. Both graphs display the 
location of the estimated component means. 
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Fig. 5.4 Courtis Reasoning Test 6 histogram for 1,235 boys. Estimated location of component 
means 21 and 2 is shown 
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Fig. 5.5 Courtis Reasoning Test 6 histogram for 1,168 girls. Estimated location of component 
means 21 and 2 is shown 


5.1.6.3  Courtis Test 7: Fundamentals 


This 14 item test [70, p. 410] resembles closely Stone's fundamental test. Test 7 
requires mostly multiplication and long division, along with some addition and 
subtraction problems. The test appears difficult, although children were given 12 
minutes. For example, a division item was 3127102463, which, by the way, equals 
6754. Courtis reports that for grade 8B girls’ average correct was 11.1, while boys’ 
average was 10.5 [70, p. 537]. 


5.1.6.4 Courtis Test 8: Reasoning 


Children were allowed six minutes on this eight item word problem test. The first 
item is "A party of children went from a school to a woods to gather nuts. The 
number found was but 205, so they bought 1955 nuts more from a farmer. The nuts 
were shared equally by the children and each received 45. How many children were 
in the party? [70, p. 414].” 

Test scores were integer valued, with possible scores from zero to eight correct. 
No child, among 27,171 4A through 8B children, correctly answered all eight items. 
Reported are the corresponding frequencies associated with zero to seven correct 
[70, p. 531]. 


V — (0.907, 0.773, 1.069, 0.991, 13629, 13542] 
mdo holds; following the analysis 


V = (0.907, 0.773, 1.070, 0.991} 
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values which essentially match the observed values. The estimated parameter values 
are 


g = 0.110, 

j^ = 0.012, 

fy = 0.756, 

fio = 2.132, 

6? — 0.960, 

h?. = 0.162, and 
hz, = 0.023. 


In Figs. 5.6 and 5.7 are histograms for boys and girls, respectively. The modal 
number correct is zero. Because it is unclear what might be plausible latent 
discrete component distributions, no estimated distributions are given, but estimated 
component mean locations are shown. 
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Fig. 5.6 Courtis Reasoning Test 8 Histogram for 13,629 boys. Estimated component locations /11 
and йә are shown 
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Fig. 5.7 Courtis Reasoning Test 8 Histogram for 13,542 girls. Estimated component locations Ё 
and йә are shown 
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5.1.6.5 Courtis Test Summary 


“Differences between the abilities of boys and girls there undoubtedly аге... [70, 
p. 524].” The girls exceed boys in multiplication and fall below them in accuracy 
of work in reasoning. “...the distribution of the individual scores in Tests 3 and 
6, which represent the extremes in the amount of difference observed [70, p. 525]." 
Test 3 is the multiplication test, while Test 6 is the 16 item reasoning test. This result 
was noted by Lincoln [65, pp. 57—58] and reported in Chap. 2. 

Although this work is well more than 100 years old and apparently largely 
unknown, the care, scope, and effort even by today's standards are impressive. In 
all cases, children were required to produce an answer, so the role of guessing in 
machine scored multiple choice formats common today is much less of an issue. 
There was no informed consent back then, and thus the issue of consent's impact on 
sampling did not occur. One might also imagine that because large-scale testing was 
not as common earlier on, children would be more attentive and motivated. Courtis' 
emphasis on test speed was by design. “But by putting the work on a speed basis 
a situation is created that is much more uniform from individual to individual and 
from grade to grade. ..[70, p. 401] Apparently Courtis directed everything. Today 
any such effort would have much more dispersed responsibility. One index of the 
care Courtis exercised is that in recomputing findings from tabled data, only small 
errors, likely attributed to rounding, have been discovered. 

It is worth noting that only a theory would seem to motivate any consideration 
of why Courtis's data would be interesting to explore from a finite mixture model 
perspective. Furthermore, only a model that has mixture component distributions 
unspecified would allow data forming the empirical distributions such as in Figs. 5.6 
and 5.7 to be sensibly analyzed. 


5.1.7 Example 7: The NAEP Math Tests 


The NAEP math tests capture five areas of math content, number properties and 
operations, measurement, geometry, data analysis and associated statistics and some 
probability, and algebra, each with varying emphasis at the three grade levels. 
Millions of children have been tested over the years. In 2019, the test was given 
to approximately 149,000 grade 4, 147,000 grade 8, and 25,400 grade 12 students 
[30]. Those NAEP data for estimation are in Tables 2.1 and 2.3. Those years for 
which V mo is satisfied can provide parameter estimates. Of the 31 possible У, only 
three eighth grade years fail to satisfy mo. Thus, for grades 4, 8, and 12, there are 
13, 10, and 5 estimable V, respectively. The averages of the parameter estimates, 
with overbars, for each grade are reported below. In parentheses are the standard 
deviations of the estimates (not standard errors). 
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5.1.7.4 NAEP Grade Four Math Tests 


All 13 V satisfied mdo, with (s? — s2)/(Xp — Xg) = 19 in all cases. The averages 
of the estimates with the standard deviations of these estimates in parentheses are 


given below: 


— 0.044(0.040), 
? — 0.003(0.007), 

й = 232.576(9.505), 
йо = 302.719(33.753), 
5? = 850.859(68.895), 
һу, = 0.118(0.037), and 
hz. = 0.006(0.005). 


5.1.7.2 NAEP Grade Eight Math Tests 


Of the 13 V 10 satisfy mdo. The averages and standard deviations of the estimates 


are given below: 


— 0.020(0.016), 
2 = 0.001(0.001), 

ji = 276.863(7.150), 
дә = 391.887(58.108), 
62 = 1265.758(78.101), 
hs. = 0.105(0.023), and 
һе = 0.002(0.001). 


5.1.7.3 NAEP Grade Twelve Math Tests 


АП five V satisfy mdo. And similarly to the above, 


q = 0.047(0.022), 
2? = 0.003 (0.002), 

ii, = 150.254(1.475), 
йо = 226.013(36.121), 
6? = 1054.533(58.070), 
һу = 0.158(0.039), and 
hz. = 0.008(0.002). 


5.1.7.4 NAEP Math Tests Summary 


The NAEP math test scores in Tables 2.1 and 2.3 are the best estimates of math 
performance for the general U.S. population of children at three grade levels. 
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Estimates of q are small, with q? vanishingly small for girls. The mean estimates 
of q at grades four and twelve are nearly identical. The mean X-linked heritabilities 
he are small and R2. tiny. Yet the estimates account for the both mean and standard 
deviation differences between boys and girls in these data. Although not given, V 
closely match V in all cases. Consequently, there appear few important graphical 
differences among the different ages or grades. The graph of 2019 grade 12 math 
solution is displayed below, in Fig. 6.1, to illustrate a related issue. 


5.1.8 Example 8: 2003 PISA Math Tests 


Table 2.5 provides PISA math V for 41 countries; for five countries mo failed; and 
these countries appear in bold font. Denmark, Indonesia, and the Netherlands all 
failed mo because the standard deviations failed to align. In Iceland and Thailand, 
the mean for girls exceeds the mean for boys. Below are the means of those 36 
countries with V satisfying mdo and with corresponding standard deviations of 
these estimates in parentheses: 


q = 0.135(0.130), 
2? = 0.034(0.056), 

йі = 479.315(50.014), 

йо = 699.421(274.231), 
62 = 7735.391(1137.271), 
һу, = 0.181(0.067), and 
hz, = 0.035(0.044). 


Collectively, the influence on X-linkage appears to be greater in these PISA math 
data than in the U.S. NAEP math data. Неге, g = 0.135. But there is considerable 
country ĝ country variability. Table 5.3 displays in column two the estimates of ĝ 
for math (denoted ĝm) for each country and in columns four and five the estimated 
heritabilities for boys and girls for math. It will be seen later on that the variability of 
these g seems quite plausible in light of recent paleogenetical studies which upend 
long held beliefs about the distribution of genes in Europe [118, 119]. 

Furthermore, the variability of g has meaningful associations with different 
countries’ Global Gender Gap Index as will be considered later in Chap. 6. And 
this diversity, reflected in the variability of 4, might well be expected: “Patterns of 
genetic diversity have previously been shown to mirror geography on a global scale 
and within continents and individual countries[120]." 

There appear corresponding similarities of PISA math ĝ estimates in Table 5.3 
related to geography. This matter will be revisited later below, but The Czech 
Republic with д = 0.302 and Slovakia with g = 0.351 have similar large g and 
were a single country until 1993. Finland 4 = 0.040 and Sweden ĝ = 0.052 with 
far smaller but similar g estimates were also a single country until 1809. It is thought 
these two countries were genetically similar to Norway 4 = 0.026 [120]. 
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Table 5.3 PISA Math and Reading Parameter Estimates 


Country Qm jr h bam h um h m Lm 

Australia 0.019 0.461 0.157 0.004 0.617 0.521 
Austria 0.037 0.506 0.172 0.008 0.802 0.756 
Belgium 0.027 0.410 0.166 0.005 0.436 0.309 
Brazil 0.143 0.374 0.199 0.039 0.382 0.241 
Canada 0.076 0.404 0.208 0.021 0.480 0.343 
Czech Republic 0.302 0.568 0.112 0.047 0.439 0.411 
Denmark NA 0.465 NA NA 0.322 0.244 
Finland 0.040 NA 0.187 0.009 NA NA 

France 0.050 0.470 0.167 0.019 0.586 0.494 
Germany 0.068 0.486 0.115 0.009 0.572 0.491 
Greece 0.202 0.367 0.243 0.072 0.498 0.333 
Hong Kong 0.005 0.327 0.263 0.002 0.549 0.346 
Hungary 0.075 0.511 0.096 0.009 0.444 0.382 
Iceland NA NA NA NA NA NA 

Indonesia NA 0.614 NA NA 0.428 0.426 
Ireland 0.367 0.534 0.127 0.068 0.446 0.398 
Italy 0.144 0.417 0.253 0.053 0.573 0.442 
Japan 0.028 0.197 0.229 0.008 0.259 0.076 
Korea 0.451 0.462 0.254 0.182 0.261 0.193 
Latvia 0.006 0.482 0.160 0.001 0.692 0.616 
Lichtenstein 0.247 0.213 0.401 0.171 0.208 0.064 
Luxembourg 0.227 0.402 0.187 0.060 0.425 0.294 
Macau-China 0.243 0.225 0.295 0.112 0.210 0.068 
Mexico 0.193 0.456 0.101 0.025 0.201 0.143 
Netherlands NA 0.477 NA NA 0.239 0.181 
New Zealand 0.148 0.394 0.162 0.032 0.278 0.175 
Norway 0.026 0.506 0.167 0.005 0.877 0.845 
Poland 0.016 0.448 0.218 0.005 0.637 0.532 
Portugal 0.075 0.414 0.250 0.026 0.576 0.443 
Russian Federation 0.065 0.329 0.181 0.015 0.386 0.216 
Serbia 0.001 NA 0.233 0.000 NA NA 

Slovakia 0.351 0.552 0.170 0.088 0.498 0.460 
Spain 0.056 0.464 0.174 0.012 0.632 0.539 
Sweden 0.052 0.541 0.093 0.006 0.587 0.543 
Switzerland 0.259 0.506 0.143 0.052 0.545 0.477 
Thailand NA NA NA NA NA NA 

Tunisia 0.476 0.599 0.088 0.063 0.295 0.286 
Turkey 0.106 0.388 0.204 0.029 0.479 0.331 
United Kingdom 0.087 0.464 0.064 0.006 0.357 0.273 
United States 0.025 0.431 0.161 0.005 0.388 0.281 
Uruguay 0.145 0.413 0.113 0.021 0.405 0.284 


Note: For math, NA denotes mo failure. For reading, NA denotes negative 52. Subscripts т and 
r denote math and reading, respectively 
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In addition, Finland has been an independent country only since 1917 and was 
largely controlled by the Russian Empire. The Russian Federation with g = 0.065, 
given in Table 5.3, seems roughly similar to 4 = 0.040 for Finland. Russia and 
Finland also share a 1340 Kilometer border. Geography could be one important 
reason for similarities of math 4 among some countries. 

The estimates of q do, at least in some cases, roughly align with the observed 
real-world frequencies of behavioral concern. Ceci and Williams observe that “In 
Turkey, men were overrepresented among computer science graduates by a factor 
of only 1.79 to 1, while in the Czech Republic, they were overrepresented. .. by 
a factor of 6.42 to 1. In the United States the *male overrepresentation factor' is 
2.10 to 1 and in the United Kingdom, 3.10 to 1 [5, p. 168].” The ordering of these 
countries based on their 4 given in parentheses provides a possible explanation as 
to why these factors differ: 


The Czech Republic(0.302) > Turkey(0.106) > U.K.(0.087) > U.S.(0.025). 


Only Turkey seems wrongly ordered, given the quote above. Given the interest of 
the over representation of men in computer science, it is interesting to compare the 
graphically portrayed solutions of The Czech Republic and the U.S.A. 

Figure 5.8 displays the Czech Republic, while Fig. 5.9 displays the U.S.A. The 
lower component estimated 11 is similar in their location for the two countries, but 
for the U.S.A. д2 is more extremely situated and the upper component is much 
smaller in size. With the scale of Fig. 5.9, the girls’ higher scoring component does 
not appear in the U.S. graph. In The Czech Republic Fig. 5.8, the higher scoring 
component is much larger for both boys and girls. The graphical comparisons of the 
two countries, and the ratio of the boys' Czech Republic's second higher scoring 
component weight 4 to the corresponding U.S. ĝ, 0.302/0.025 = 12.08, would 
seem to give some indication as to why the computer science graduates in The Czech 
Republic are far more overrepresented by men than in the U.S.A. 

Also, there appears to be evidence of some agreement in the sizes of g with 
estimates from other sources. The U.S. 4 = 0.025 is within the range of the 
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Fig. 5.8 The Czech Republic PISA Math Test Estimated Solution from V in Table 2.5 
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Fig. 5.9 United States PISA Math Test Estimated Solution from V in Table 2.5 


corresponding average estimates from NAEP data in Example 7, 0.020 to 0.047. 
The U.S estimate is also the fifth smallest value among the 36 q estimates. So, the 
U.S. ĝ is not large in comparison to mostly European countries. Another example is 
Italy. From Example 3, 4 = 0.178 for Italian children. The PISA estimate for Italy 
is ĝ = 0.144. 

Just how different are the boys' and girls' math test score distributions in different 
countries? And just how different are the U.S. boys' and girls' distributions relative 
to many other countries? The answer is that U.S. boys' and girls' distributions are 
far more similar than most of the OECD countries. Estimates of the probability 
distributions f,(y) and f,(y) are easily obtained for each country for which mo 
is satisfied, by replacing that country's parameters with their estimates in their 
distributions and obtaining estimates f (y) and f (y). Once available, an estimate 
of the Hellinger distance, H, defined above, is most easily obtained using numerical 
integration. 

Table 5.4 provides, in column two, Hd = 100Ё in math for all 41 countries. 
For those countries for which mo failures occurred, Hd based on normality is given 
and appear underlined. When Hd is computed under normality, the estimates tend 
to be somewhat smaller than Ad under Y. There is a sizable range of Hd differences 
among countries. The U.S. estimated solution displayed in Fig. 5.9 has Hd — 5.833. 

Macau-China is the second largest among 41 Hd with Hd = 9.956 and its 
corresponding 4 = 0.243 is nearly ten times the size of д = 0.025 for the U.S.A., 
with estimates given in Table 5.3. Macau-China's graphical solution is given in 
Fig. 5.10, which shows dramatically the much larger sex differences favoring boys 
in PISA upper tail math performance than for the U.S.A. Comparing math sex 
differences in the U.S.A. with other countries reveals that the U.S.A. displays among 
the smaller sex differences in math test performance as gauged by Hd. It is twelfth 
smallest among those 36 countries with model-based Hd estimates (not underlined) 
in Table 5.4. As will be seen, much the same situation holds for reading as well 
except that girls dominate boys. 
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Table 5.4 PISA Hd and Country Нат Hd, GGGI 

Global Gender ap TUR Australia 6.075 |15.645 | 0.738 
Austria 6.020 |17.744 | 0.781 
Belgium 6.874 | 12.873 | 0.793 
Brazil 9.479 | 13.523 | 0.696 
Canada 7.027 | 13.630 | 0.772 
Czech Republic 5.808 | 12.145 | 0.710 
Denmark 6.441 | 10.624 | 0.764 
Finland 6.419 | 20.562 | 0.860 
France 5.810 | 15.096 | 0.791 
Germany 4.482 | 14.787 | 0.801 
Greece 8.618 | 14.108 | 0.689 
Hong Kong 5.291 15.048 | NA 
Hungary 3.858 12.573 | 0.699 
Iceland 8.046 | 22.506 | 0.908 
Indonesia 2.031 | 11.650 | 0.697 
Ireland 6.341 | 12.476 | 0.804 
Italy 8.505 | 15.189 | 0.720 
Japan 8.184 9.069 | 0.650 
Korea 9.342 9.458 | 0.689 
Latvia 5.209 | 16.613 | 0.771 
Lichtenstein 12.294 | 7.835 | NA 
Luxembourg 7.402 | 12.670 | 0.736 
Macau-China 9.956 7.933 | NA 
Mexico 5.252 8.687 | 0.764 
Netherlands 1.956 8.974 | 0.767 
New Zealand 6.347 9.902 | 0.841 
Norway 5.991 | 18.281 | 0.845 
Poland 7.156 | 16.046 | 0.709 
Portugal 8.011 | 15.239 | 0.766 
Russian Federation | 6.198 12.070 | NA 
Serbia 1.993 | 20.117 | 0.779 
Slovakia 7422 | 13.173 | 0.717 
Spain 5.958 | 15.866 | 0.788 
Sweden 3.611 | 14.606 | 0.822 
Switzerland 6.512 | 14.217 | 0.795 
Thailand 0.030 | 20.177 | 0.709 
Tunisia 5.936 | 10.303 | 0.643 
Turkey 7.453 | 13.695 | 0.639 
United Kingdom 3.010 | 11.263 | 0.780 
United States 5.833 | 11.922 | 0.769 
Uruguay 5.405 | 13.501 | 0.711 


Note: Hd», is math, and Had r is reading. Under- 
lines are estimates under normality. NA = Not 
Available 
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Fig. 5.10 Macau-China PISA Math Test estimated solution from V in Table 2.5 


5.2 Reading Examples 


The reading V and their corresponding model estimates may be viewed just as 
the math estimates were viewed. The only fundamental data difference is that 
X, > Xp and associated figures showing boys anchor the bottom of the reading 
test distributions. 


5.2.1 Example 9: British Reading Test 


Children aged 10 years completed the British reading test called Group Reading 
Test II [50]. 


V — (97.50, 100.96, 13.20, 11.58, 117, 115], 
which satisfies rdo and with 
V = (97.50, 101.00, 13.41, 11.33}. 


Below are the parameter estimates, with standard errors in parentheses: 


g = 0.218(0.16), 
q^ = 0.048(0.09), 
ji, = 81.6323), 
йо = 101.92(49), 


hz = 0.390(0.10), and 
hx — 0.145(0.08). 


The estimated higher scoring component, Ê (y), has weights 1 — q = 0.782 for 
boys and 1 — 4? = 0.952 for girls. The estimated V agrees well with V. 
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Fig. 5.11 Example 9. Thick lines are ГА (y) for boys and h (y) girls, and thin lines are their two 


components RO) with ordinates at д and fi» at the estimated means of the two-component 
densities 


Figure 5.11 graphs the estimated solution assuming normal components and 
displays the estimated component means й and йә. Notable is the very large excess 
of boys, relative to girls in the lower tails of their associated distributions. The figure 
clearly reveals why the reading sample mean for boys is smaller, but sample variance 
for boys is larger; their upper tails show more similarities. There are more girls than 
boys in the upper tail regions, although the sizes of the differences are smaller, 
vanishing so as test scores increase. 


5.2.2 Example 10: The NAEP Reading Tests 


As with the math tests, millions of children have been tested over time. The 
reading test resembles in spirit many earlier reading tests. There are paragraphs 
to read which are followed with questions which target three cognitive categories: 
locate/recall, integrate/interpret, and critique/evaluate, questions that presumably 
reflect the kinds of thinking required to understand written text. The V data 
employed аге in Tables 2.2 and 2.4. There аге 35 reading У; five V fail to satisfy ro, 
all with standard deviations for boys and girls matching. All 30 V which satisfy ro 
also satisfy rdo. The mean of the estimates for each grade along with the standard 
deviation of the estimates in parentheses is given below. 


5.2.2.1 NAEP Fourth Grade Reading Tests 


The model estimates of 11 V with standard deviations of the estimates in parenthe- 
ses are given below: 


q — 0.326(0.083), 
q? — 0.112(0.055), 
Д1 = 192.320(6.651), 
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fia = 227.152(1.733), 
a? = 1200.524(90.431), 
һу, = 0.178(0.058), and 
hz = 0.091(0.054). 


5.2.2.2 NAEP Eighth Grade Reading Tests 


As above, the estimates of the 12 V and their standard deviations in parentheses are 
given 


q = 0.507(0.044), 
q? = 0.259(0.044), 

Д1 = 236.431(6.143), 
jio = 280.423(2.884), 
62 = 777.853(160.755), 
h? = 0.385(0.136), and 
йе, = 0.329(0.141). 


5.2.2.3 NAEP Twelfth Grade Reading Tests 


With seven V, the estimates and corresponding standard deviations in parentheses 
are given below: 


q = 0.474(0.098), 
22 = 0.233(0.091), 
Д1 = 255.198(2.288), 
йо = 307.203(8.360), 
o? = 775.978(408.644), 
he. = 0.488(0.239), and 


һе, = 0.417(0.280). 


5.2.2.4 NAEP Reading Test Analysis Summary 


As with math, although not displayed, V closely resemble V in all cases, in many 
cases matching V, as do all twelfth grade V. The estimates of q for reading are 
much larger than estimates of q for math. The much larger mean separation favoring 
girls in reading reflects the much larger ĝ estimates. Heritabilities are larger as well, 
with average X-linked heritabilities increasing with age, for both boys and girls, a 
commonly observed finding with polygene heritability estimates. 

Figure 5.12 shows the graphic for the 2015 grade 12 NAEP reading V in 
Table 2.4. As with the previous example, it reflects the often observed commentary 
in the literature, namely there are more girls in their upper tail than there are boys 
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Fig. 5.12 NAEP 2015 grade 12 reading estimated solution, V from Table 2.4. Thick lines are 
fp(y) for boys and f, (у) girls, and thin lines are their two components fj (y) with ordinates at 21 
and дә at the estimated means of the two component densities 


in their upper tail of the reading test score distributions, but more boys than girls in 
the lower tail regions as have been reported [121, 122]. 

Focusing on Fig. 5.12, the probability area between the solid and dashed bold 
lines can be estimated under Y for intervals of interest. The equations of concern are 
all defined in Chap. 4. For example, [yl fe(y) — fa(y)]dy = 0.091. So, there are 
about 9% more girls in this upper tail region above 300 than there are boys. The solid 
and dashed bold lines cross in Fig. 5.12 at reading score 279. The area difference 
between the corresponding boys and girls lower tails is, up to the point where the 
dashed and solid bold lines cross, FU = h (y)]dy = 0.106 reflecting the 
obvious excess of boys relative to girls in the lower tail regions, here about 11%. 


5.2.3 Example 11: 2003 PISA Reading Tests 


Table 2.6 provides the PISA reading V for 41 countries. Compared with the different 
countries mean math sex differences, the reading mean differences are much larger. 
The reading differences x, — хь range from 13.27 to 57.76. By comparison, Xp — Xg 
for math range from — 15.41 to 28.84. The average of the 39 mean math differences 
favoring boys is 10.313. The average of all 41 reading differences favoring girls .V 
is a random effects model, and 33.682. 

Although all countries well satisfied rdo, four V failed in parameter estimation: 
Finland, Iceland, Serbia, and Thailand. These countries have large mean separation, 
always more than 42 points. Y is the random effects model, and o? is estimated by 
a subtraction. Consequently, the estimate can be negative. 62 was negative for these 
four countries and so parameter estimation under Y failed. Parameter estimates are 
available for 37 countries. Below are the means of the 37 countries’ estimates, with 
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the standard deviations of the estimates in parentheses: 


4 = 0.440(0.096), 

42 = 0.203(0.080), 

Д1 = 390.983(44.857), 

йо = 525.916(44.305), 

6? = 5190.130(2701.927), 
һу, = 0.460(0.271), and 
һе, = 0.364(0.316). 


The sizes of these much larger sex differences in reading are attributable to 
much larger estimates of q and thus q? for reading than the smaller q and often 
vanishing q? estimated values for math. The reading 4 (denoted as G,) are given in 
column three of Table 5.3, along with X-linked estimated reading heritabilities in 
columns six and seven. Hd for reading are given in column three of Table 5.4. For 
those countries where estimation failed, Hd estimates under normality are given and 
denoted by underline. There are large differences between boys’ and girls’ reading 
distributions among the countries and typically far larger than the math differences 
as may be seen by comparing columns two and three in Table 5.4. As with math, the 
reading distributional differences for the U.S.A. are comparatively small, with the 
U.S. distributional difference Hd — 11.922 the twelfth smallest of the 37 model- 
based Hd estimates. 

Pictorially, the contrast is illustrated in Figs. 5.13 and 5.14 which gives the graphs 
of the countries with the smallest and largest .V-based estimates Hd. Liechtenstein 
has the smallest separation, with Hd = 7.835 and Norway with Hd = 18.281 the 
largest (again, among those countries without their Hd underlined in column three 
of Table 5.4). 

The graph for Norway might seem implausibly surreal with a clear estimated 
bimodality in reading scores for both boys and girls. In the PIRLS Norway reading 
data [14, Table 3] for the 2001 and 2006 assessment cycles, both V satisfy rdo and 
yield graphs more like Fig. 5.11, without the clear bimodal of Fig. 5.14. However, it 
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Fig. 5.13 Liechtenstein PISA reading solution, V data from Table 2.6. Thick lines are ГА (у) for 
boys and f, (у) girls, and thin lines are their two components f; (у) with ordinates at Ё and дә at 
the estimated means of the two component densities 
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Fig. 5.14 Norway PISA reading solution, V from Table 2.6. The distributions БО) for boys and 
p? (y) girls coincide with their component distributions f (y) and h (y) because the components 
are widely separated. The standard deviation of each component is 37, while the component means 
Ёл and ќо are 197 points apart. Thus, the component means аге more than five standard deviations 
apart 


is expected boys and girls, in some countries, would reveal bimodal test performance 
even in smaller samples, given the sizes of the reading test performance differences 
in some countries. The 2006 PIRLS V for Qatar [14, Table 3] yields graphical 
bimodalities under У. 


It is difficult to overemphasize how much girls dominate boys in reading, 
seemingly in all developed countries worldwide. However, the database here 
considers largely OECD countries, those with PISA scores. The sizes of reading 
mean differences favoring girls simply far exceed the size by which boys typically 
exceed girls in math test mean. 
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Chapter 6 A 
Summaries and Model Extensions FEEN 


Further model applications, analytical extensions, and clarifications are detailed in 
12 sections after the first section which summarizes results and implications of the 
Chap. 5 analyses. 


6.1 Empirical and Conceptual Main Points 


It seems reasonable to suspect that anyone who has read this far is likely to be much 
more interested in sex differences in math than in sex differences in reading. One 
might therefore start on a lighter note by observing that perhaps the above theory 
might provide the justification for a popular book’s title: The Math Gene [123]. 
Curiously, the title is misleading. Devlin rejects any notion of a math gene. He writes 
“Roughly speaking, by ‘the math gene’ I mean ‘an innate facility for mathematical 
thought...’ [123, p. xvi].” 

"There cannot be any single or simple answer to the many complex questions 
about sex differences in math and science. Readers expecting a single conclu- 
sion...are surely disappointed [12, p. 41, italics added].” Ignoring the arrogance 
of the claim, certainly there are settings in which sex differences concerning math 
and science are boldly evident and which may be the result of bias or discrimination 
[124]. Such matters are certainly of concern. However, one of the most important, 
most widely noted, and most puzzling sex differences has concerned differences in 
math test score distributions obtained in observational settings. Such distributional 
differences have been evident for more than a century in the U.S.A. and are evident 
globally today. These differences have now been coherently explained by a model 
which, at root, is little more than the model for two independent coin flips. АП 
the analytical model inequalities can be traced to sex differences in probabilistic 
outcomes of binary events. The same model addresses the widely neglected but far 
larger sex differences in reading test score distributions as well. 
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Thus, contrary to what has been claimed (with some exceptions to be addressed 
momentarily), a simple and coherent answer can be given to empirical test score sex 
differences puzzles. It is simply a matter of first, recognizing the patterns in data, 
as Kagan [34] wisely stressed, and second, modeling the processes appropriately. 
The answer provided by Y may not be a popular answer, nor perhaps the answer 
hoped for or desired. However, Y provides a plausible answer. There exists no 
alternative, formal, or otherwise, which provides the conceptual basis for addressing 
a wide range of empirical facts and in particular provides the theoretical basis for 
the inequalities of S. The level of model to empirical correspondence, that is, the 
correspondence between V and V especially within a simple framework, seems 
rare. Consequently, it seems reasonable to suggest that Y resembles the process 
that generated the V sample data. Y has just three critical parameters to estimate: 
4,42 — ш, and с, the same number as the effect size ô. If the law of parsimony 
(Occam's razor) applies, then Y would appear to set a high bar for alternative 
explanations to achieve. Any competing framework must coherently account for 
the inequalities of S and hope to do so with three parameters. 

There are settings where Y does not hold, or at least the estimation algorithm 
fails, even if the elements of S are satisfied. In the case of PISA math testing, in 
five countries mo fails. These countries are noted in Table 2.5. In some cases, these 
failures may be because of sampling variation when q for math for a particular 
country may be small. It may be because of girls' dominant advantage in reading, 
thus facilitating girls’ performance on some PISA math test items. The language of 
the country in which the test is administered may impact boys and girls differently, 
thus influencing math test performance in ways not fully understood. And it must 
be admitted that the model may be flat-out wrong, at least for some countries. 

For reading scores, rdo seems to nearly invariably hold, and PISA test failures 
of four countries in Table 2.6, and noted in Sect. 5.2.3 were of a different kind, 
namely estimation failures when 62 < 0. It seems likely the assumption that Np 
and № share the same variance is wrong. There are several cases as well where 
mdo holds, but estimation fails for example [47], and likely for the same reason. 
Relaxing the equal variance assumption could be addressed within a likelihood 
model framework, but estimation in this case would require a data stream, much 
more conceptual machinery, and iterative methods. 

While there are no competing models to Y, no claim is made that the model is 
the "correct" model as all models are wrong models [125]. However, even wrong 
models can be useful, and the breadth of explanatory power seems remarkable for 
simple model. The model appears to provide a coherent explanation for the widely 
noted differences among countries in their PISA test scores, except for those cases 
just mentioned. Being able to estimate the parameters of Y based on elements of V 
without requiring a raw data stream and then subsequently being able to illustrate the 
model graphically can certainly be viewed as a model strength. There are numerous 
other V which could be analyzed. What has been presented is only a sampling. 

It has been noted repeatedly that the estimates of q for math are generally small 
and correspondingly so are the X-linked heritabilities for math for both boys and 
girls. This result reveals that most of the test score variance in data is unexplained, 
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an outcome that was expected. X-linked heritabilities are far larger for reading, but 
substantial variance remains unaccounted in reading test scores as well. Thus, there 
is much research required to identify other plausible and likely larger sources of 
variance. 

To untangle other sources of influence is likely to require new frameworks for 
thinking about sex differences which puts empirical regularities at the center of 
attention as, to note again, Kagan [34] has eloquently argued. This simply has not 
occurred. Witness the general disregard for differences in test score variances as 
having any substantive relevance for understanding either math or reading test score 
sex differences. The viewpoint here is that effect size approaches have hijacked 
efforts at understanding and impeded if not halted conceptual progress. Meta- 
analysis approaches have not contributed to conceptual understanding, a fact that 
has long been recognized [126]. Yet the belief in the value of effect size analysis has 
grown such that it is has been claimed it is a viable framework for апу sex difference 
variables of interest. 

In Figure 1 [15] are displayed four panels. In each panel are two equal variance 
but shifted normal distributions; each panel implies a different effect size. The 
authors write "Figure 1 shows four possible alternatives for the distribution of 
males' and females' scores on a trait, which could be anything from hippocampus 
size to mathematics performance [15, p. 177, italics added].” Such a statement 
implies a remarkably naive belief that all between-group mean differences are 
understandable through a location-shift, effect size framework, as if this perspective 
is the only way by which Nature could produce sex differences in task performance. 

Not only is an effect size approach inappropriate for math (and reading) test 
Scores, but it is inappropriate for the measures of the hippocampus as well. That is 
because, while it might be surprising to learn, mo holds for hippocampus volumes 
as well. Ritchie and others in their Table 1 [127] report the means and standard 
deviations of the left and right hippocampus volumes of 2466 men and 2750 women. 
Data for both left and right volumes satisfy mo as do all 15 additional brain variables 
in their table. 

Two cognitive test variables reporting sex differences also appear in their Table 1. 
One test satisfies то, while the other test reports reaction times. Perhaps the 
questions should be, first, for what sex difference variables is an effect size model 6 
appropriate and thus the calculation of d sensible? Second, could X-linkage play a 
role in brain sex differences? 

Nature can produce sex differences in test score distributions, or sex differences 
in other settings, by many different vehicles. The researcher's task is to attempt 
to learn which vehicle Nature uses and to try to model that process suitably. This 
fundamental step in the research process has gone missing. 

Under „У, it is assumed Nature produces distributional differences by a frequency 
mechanism which is different from a location-shift mechanism. This is illustrated 
by all the Chap. 5 examples. Recall that under Y both boys and girls share identical 
latent component distributions, f(y), k = 1,2, so boys and girls within the same 
component share identical math test score distributions. What produces the mean 
differences (and variance differences) are the frequencies with which these scores 
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appear. These frequency differences are determined by the component coefficients, 
functions of q and q?. Another way of thinking about differences, and as noted 
earlier, is that .V models the latent within-sex differences, and these within-sex 
differences produce the between-sex differences observed in data. By understanding 
within-sex differences, the between-sex observed differences may well take care of 
themselves [128]. 

“This book is about the reasons males are overrepresented in mathematics and 
mathematically intensive professions. . . [5, p. 1x].” This was the motivation of Ceci 
and Williams, 2010. Their motivation was clearly a frequency matter, not a location- 
shift matter: why are there more men than women in some professions? Yet in 
"Gauging the size of sex differences," they use d [5, pp. 20-21] which does not 
index frequency differences. This does not imply of course that a location-shift 
perspective cannot often lead, plausibly, to observed frequency differences. But 
given that interest in sex differences often starts with everyday observations of 
frequency differences of boys and girls or men and women in various settings 
[23], it does suggest that one needs a broader perspective in how sex differences 
are conceptualized. Although it may seem heretical to suggest it, abandoning the 
devotion to effect sizes may be a good place to start. There is much to admire about 
the openness and scholarly approach of the now rarely referenced nearly 50 year old 
Maccoby and Jacklin book [53]. 

Recently, Casey and Ganley [20] have expressed interest in considering within 
sex differences. Latent processes often viewed as latent distributions appear to 
have attracted little interest among sex differences researchers. The main reason is 
likely because of the dominance of effect size procedures which leave no space for 
latent variables thinking achievable through mixture models [99, 129]. If between- 
group differences are the result of within-group component distributions with 
different component weights, as in Y, then no between-group location-shift model 
is appropriate. That is because each group's probability distribution is of a different 
shape. Under a location-shift perspective, the distributions of different groups 
remain identical in shape. In real-world realizations, however, there may appear 
to be no sex differences in graphical portrayals as Fig. 5.2 of Example 4 reveals. 

One conceptual fact that needs to be emphasized and was described in Example 3 
and illustrated in Fig. 5.1. Namely, that for both sexes, the majority of high math 
performers come from the lower, not higher scoring latent component. This is true 
for example, for the U.S. PISA math data, portrayed in Fig. 5.9. Only those boys 
with scores above 667 more probably come from the higher component. Virtually 
all higher scoring girls come from the lower scoring component. The reason is 
because both for boys and girls the proportions of individuals in the higher scoring 
component are small, so most of the probability mass is in the lower scoring PISA 
math component. For boys the higher scoring component has about 2.596 of the total 
probability mass, while for girls it is about 0.196. 

To maintain the view that, under Y, only those boys and girls in the higher scor- 
ing components are high math achievers—which a casual reader might believe—is 
incorrect. An interesting question to ponder, however, is whether two individuals 
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with the same high math score, but with scores that arise from different latent 
component distributions, are equally talented. 

Y has nearly 40-year old conceptual roots [104]. There is a historical commen- 
tary about this early work and its corresponding reception [130]. The following 12 
sections address additional model applications, add some further analytical results, 
and consider issues which have been alluded to earlier. 


6.2 Math Meta-analyses and Variance Ratios 


The puzzle of why small d > 0 and small average d or 4 > 0 are commonly 
observed in math meta-analyses was noted at the outset in Chap. 1. No conceptual 
interpretation of these findings has appeared. However, from the perspective of Y, 
a conceptual explanation emerges. 

To see how d is viewed under Y, replace the sample values in the expression for 
d, given at the outset in Chap. 1 with their Y parameter values, given in Chap. 4. 
For example, replace x, with up = q(1 — q)u1 + quo and s with o? = q^ — 
q) (u2 — "mu + c?. After some algebra, obtain 


E 401—4) 
„1/82 +901 — 43)/2 


n 


with € = (ио — и) /с. Thus, d under Y estimates 7, not б, and 7 provides the 
conceptual model for d under Y. 

Although 7 has no useful math-substantive interpretation, inspection of 7 reveals 
why small d > 0 appear. First, 7 > 0, so d > 0 are expected. Second, 5 will be 
small when q is small, and estimates 4 suggest q for math is small—at least for 
U.S. math test results. Hence, d > 0 should be small. As q — 0, п — 0 and 
similarly for d. т] increases with increasing component mean difference uz — ш]. n 
increases with decreasing с and correspondingly increasing &. у € (0, 0.556) and 
is maximum when & = оо and q © 0.37. 

For illustration, consider Example 4. d = 0.067; replacing the parameters in 7 
with estimates gives ñ = 0.067 with д = 0.015. Of course, d will not always be 
positive, but in expectation under Y it should be. Thus, Y provides an explanation 
revealing why small d > 0 appear and in particular small d > 0 in math meta- 
analyses. 

Variance ratios of 5? / 52 greater than one have also been a puzzle. As with the 
above, replace the sample quantities with the quantities which they estimate under 
Y, and then form the corresponding ratio. se /82 estimates the ratio gr ар which 
equals 


q0 — qua — ш)? +0? 
gA — q2)(uo — ш)? + o?' 
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Inspection reveals why the corresponding sample ratios are larger than one. That is 
because q(1 — 4) > q?(1—q?) when 0 < q < 0.618. Estimates of 7 have been well 
less than 0.618 for both reading and math. For Example 4, i / 52 = 1.306, while 
the model expression, just above, with estimates replacing parameters gives 1.341. 


6.3 Arguments Against Genetic Influences 


It is difficult to find explicit statements as to why sex differences, especially in math, 
are not to be found in matters genetic. But consider these quotes: 


For some countries, girls do as well or better than boys at the left tail, but worse at the right 
tail (United States, Hungary). In other countries, sex differences are most pronounced in the 
middle of the distribution (Russia, Austria). It is hard to come up with a compelling genetic 
explanation for such diversity! [5, p. 164]. 


If the genetic contribution were strong, however, then males should predominate at the 
upper tail of performance in all countries and at all times, and the male-female ratio should 
be of comparable size across different samples [131, p. 956]. 


...there is no reason to believe that the genetic factors involved in determining gender will 
vary across countries, implying that to the degree that gender differences in mathematics 
result from genetic factors, there should be no international variation in these differences 
[132, p. S140]. 


...the gender gap in math differs substantially across countries. Hence, ‘nature’ cannot 
be the only account for the females’ disadvantage in math; there must be alternative 
explanations. . . [61, p. 2]. 


A puzzle is that these quotes all seem to imply the belief, which seems widely 
held, is that genetic influences if such occurred, they should resemble a constant 
effect, invariant over countries, as if genes, their relative frequencies, and their 
estimates did not vary. One need only consider human height, a phenotypically 
genetically influenced sex difference variable, which varies in countries around the 
world both in the height of men and women and in the sizes of their mean differences 
[133]. 

Gene frequencies are known to often vary widely over different populations 
and countries due to migration factors, geography, climate, altitude, and mutations 
among other factors [134—137]. As the previous chapter's examples, analyses, 
and figures reveal, the relative size of ĝ for the PISA data varies widely over 
different countries, both for math with 0.001 < ĝm < 0.476 and for reading with 
0.197 < à, x 0.614. 

As observed earlier, part of this variation seems plausibly attributed to geography 
and country history. And recall that The Czech Republic and Slovakia have 
among the largest PISA math qm estimates, respectively, Gm = 0.302, 0.351, from 
Table 5.3. These countries share a border and were earlier a single country. Norway 
and Sweden are another pair with a common border; their 4 for both reading 
and math will be considered below within another context. However, note here, 
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Gm = 0.026, 0.052 for Norway and Sweden, respectively. Within each pair of 
these countries, the ĝm seem similar, but between the pairs of countries the ĝm are 
remarkable different. 

It had been thought Europe was largely homogeneous in genetic variation. Recent 
palaeogenetical evidence clearly falsifies this belief and makes the variation in 
estimates of PISA q, over different countries, even more conceptually plausible. 
There were two genetically distinct groups in Europe around 30,000 years ago: one 
group living in France and Spain and the other group living in what is now The 
Czech Republic and in Italy [118, 119]. ĝm = 0.050, 0.056 PISA estimates for 
France and Spain, respectively. The Italian PISA estimate ĝm = 0.144, plus two 
additional Italian estimates ĝm = 0.178(0.023), 0.237(0.028) with standard errors 
in parentheses, estimates from [60] appear to diverge from The Czech Republic 
PISA estimate Gm = 0.302. But collectively the Czech Republic PISA and Italian 
qm estimates are multiples larger of the France and Spain PISA ĝm. The ĝ, for 
these four countries seem more similar than do their ĝm as Table 5.3 reveals. For 
France and Spain, respectively, 4. = 0.470, 0.464. For The Czech Republic and 
Italy, 4. = 0.568, 0.417, respectively. 

An additional part of this variation, as already noted, in these parameter estimates 
is likely because they are different test translations, which may advantage boys and 
girls differently, depending on the language and culture. There is of course sampling 
variation in estimates, as noted earlier. What is nearly constant over countries are the 
inequalities in S. 


6.4 The Search for Biological Genetical Evidence 


While Y models explicitly X-linked influences, other influences if they are plausibly 
viewed as being additive effects, both genetical and otherwise, can be represented 
through the location parameters jz; and иә. N can absorb variance changes. 
Consider the 1990 and 2017 eighth grade math means in Table 2.1. Both sexes 
increased 20 points over a 27 year interval. Component means fi; = 262 and 282 
and йә = 407 and 439, 1990, and 2017, respectively, reflecting these mean changes. 
The other parameter estimates remained similar. For example, д = 0.007, and 0.006 
for 1990 and 2017, respectively. 

Genome wide association studies (GWAS) have become the most widely used 
approach for mapping genotype to phenotype associations [138]. These approaches, 
which may be viewed as a kind of data mining, can have serious difficulties because 
of linkage disequilibrium [139]. 

The core notion is that phenotypes are the results of thousands of minuscule 
genetical elements, SNPs (single-nucleotide polymorphisms), the outcomes of 
which additively combine, in a Mendelian matter, and are modelled as sums of 
random variables and are called polygenic scores. These scores are obtainable 
for individuals. Using polygenic scores for variables which may be considered 
proxies for an intelligence score, such as educational attainment, it has been 
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claimed *... will bring the omnipotent variable of intelligence to all areas of the life 
sciences without the need to assess intelligence [140, p. 157].” For a very different 
perspective on this possibility, see Charney [141]. 

If the goal of understanding math and reading test score sex differences in 
task performance is seen as equivalent to the goal of coherently accounting of the 
inequalities of S, which is the view here, then GWAS approaches, as currently 
constituted, appear unable to address the matter. There are a number of reasons for 
this conclusion. One is that GWAS requires huge sample sizes in the thousands 
or hundreds of thousands. Such a database for math and reading testing simply 
does not exist. More fundamental, however, is that the target traits for GWAS 
are known observable phenotypes. In Y the two phenotypes are unobserved latent 
distributions. GWAS approaches cannot address latent processes. To add a latent 
processes layer to the already complex GWAS framework would invariably increase 
outcome uncertainty which would seem to suggest the need for sample sizes perhaps 
unattainable for any variable of interest. 

As stated before, at least for math, X-linked heritability is generally very small. 
So there is substantial variance unmodelled and unexplained. As just noted, the 
variable N or Np for boys and №, for girls reflects these contributions, some of 
which doubtlessly are polygene effects. There is no inconsistency here. A trait can 
have a major Mendelian influence as well as polygene influences. An unresolved 
issue in biology is where, along the continuous spectrum ranging from Mendelian 
genetics to complex polygene traits, particular phenotypes reside [142]. Whether 
current technology allows the biological identification of genes with small relative 
frequencies implied by Y seems unclear. 

Genetical theory and its wide acceptance has been achieved historically through 
conceptual arguments and how these arguments "fit" with phenotypical data. It is 
difficult to find fault with .V on these grounds. 


6.5 qasthe Realization of a Random Variable Q 


Because each individual V has been viewed throughout, as the unit of analysis, q 
may be viewed as random over different V. As Table 5.3 makes clear, ĝ certainly 
varies widely, at the country level, for both reading and math. Within countries, there 
are doubtlessly subpopulations reflecting population flows over the ages, as well 
as the known stochastic behavior of genes [143]. Assuming q is a fixed unknown 
constant for any V is, as features of all models are, an approximation to reality. More 
realistically, ĝ is a mean value of different values of q. This within V randomness, 
while not explicitly modeled, does not appear to jeopardize the core features of the 
model. The following argument hopefully makes this clear. 

Let the random variable Q have realizations q and with continuous or discrete 
distribution function G(q). Consider math for boys. Clearly, Р(Вь = џ2|9) = 9. 
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Then, 
P esque | P(By = џ21)16(0) = faso = E(Q). 


So, estimates of q may be viewed as estimates of the mean of Q and similarly for 
girls. 


6.6 Sex Differences in Distributional Tails 


As has been noted above, a widely recognized empirical marker for sex differences 
particularly in math is observed differences in the test score distributional tails, with 
boys having larger math right tails than girls [11, 12, 144]. Often proxies for these 
differences, such as effect size or variance ratios, are of focus. The comparison of 
reading and math tail areas, both upper tail and lower tail, has also been observed: 
“The sex difference in mathematics was non-existent in the lower end of the 
performance distribution, but the sex difference in reading at the lower end was 
at its peak [122, p. 4].” The matter of right tail inequality is particularly concerning 
to Ceci and Williams [5]. They make reference to the “right tail" more than fifty 
times in their book. 

That Y produces graphs which portray these tail area differences both for reading 
and for math is clear from the many figures displayed above. What is noted here is 
that with an additional distributional assumption, these differences follow from Y. 

Define 


r(s) = foray / f fg )dy, —oo < s < oo. 


If the component distributions of f,(y) and f,(y) are assumed to be normal, which 
is the assumption in the graphic displays when components appear, then r(s) is 
strictly increasing as s increases (for the argument, see [104]). That is, the ratio 
of the upper tail areas not only favors boys over girls but also the ratio r(s) will 
increase as, s, the smallest test score increases. American Mathematics Competition 
data [145] are consistent with the theory. A similar result holds for reading. 


6.7 Two Additional Alternatives for d 


In addition to the Hellinger distance, two other possible replacements for effect size 
d are suggested here. Neither require raw data. Their downsides are that they require 
specification of boys and girls test score probability distributions and computation 
with software. In the examples below, the component distributions are normal. 
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The spirit of this approach is by what proportion does one sex beat the other in test 
scores? Consider the probability P(Y, < Y,), the proportion of girls’ test scores 
which are greater than the boys’ test scores. It is perhaps intuitively clear that if 
the random variables Y, and У, shared the same continuous test score distribution, 
P(Yp < Y) = 1/2. Then the departure of an observed proportion, Ê(Y, < Yg), 
from one-half expresses the separation of boys from girls. One might think pairs of 
boys and girls test scores would be required to assess the matter. However, this is 
not necessary as there is a general expression which is 


P(Y, < Ye) = / F(s) fg(s)ds, —oo < s < oo, 


where s is a test score. Fp(s) is the lower tail cumulative probability distribution 
for boys and f(s) the probability distribution (density) for girls. Both appear in 
Chap. 4. While the focus is on the distributions under Y, the integral equation holds 
for all continuous probability distributions. Estimates F, (s) and f (s) are easily 
obtained by replacing their parameters with their estimates. Then a line or two of R 
code, using numerical integration, return ÊY, b < Yg). 

Using estimates obtained from the V associated with the U.S. 2003 PISA reading 
scores in Table 2.6 results in P(Y, < Yg) = 0.590. Girls beat boys here. 


6.7.2 The Overlap Coefficient OV L 


OV L computes the amount of overlap between two distributions [146]. There are 
two versions: one for discrete distributions and one for continuous distributions. 


OVL = Y mini (у), f20)]. 


which just sums the minimum height at each value of y. In a similar way for 
continuous data, 


OVL= | minl f», ОЎУ. 


It is clear from these two definitions that if the distributions of boys and girls 
coincide, then O V L — 1, and if they share no support, that is, their distributions are 
disjoint, then OV L — 0. 

OV L provides a useful way to portray how boys and girls are similar or different 
from each other with regard to their relative reading and math distributional overlap. 
Figures 6.1 and 6.2 display NAEP twelfth grade math and reading solutions. 
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Fig. 6.1 NAEP 2019 Grade 12 math solution with V from Table 2.3. The unshaded area is the 
area of overlap 
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Fig. 6.2 NAEP 2015 Grade 12 reading solution with V from Table 2.4. The unshaded area is the 
area of overlap 


Focus on f (y) the bold solid line and fo) the bold dashed line. The white 
areas represent the overlap, while the shaded areas represent their corresponding 
distributional departures. The figures make clear there is much greater similarity in 
U.S. NAEP math scores at Grade 12 than in the NAEP reading scores at Grade 12. 


For math, OVL = 0.971, while for reading OVL = 0.894. 


6.8 A PISA Reading and Math “Paradox” 


This section addresses what is claimed to be a paradox. Addressing it is unrelated 
to matters pertaining to Y, and thus it may be skipped if desired. But because the 
matter can be addressed with the data provided here and a resolution provided, the 
paradox is considered. 

Figure 6.3 plots the difference scores girls’ mean minus the boys’ mean in read- 
ing, against the boys’ mean minus the girls’ mean in math, for all 41 countries with 
PISA data from Tables 2.5 and 2.6. r — —0.589. Marks [147] first reported such a 
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Math mean differences, boys minus girls 


Fig. 6.3 Plot of 41 countries PISA reading mean differences, girls minus boys, against PISA math 
mean differences, boys minus girls. Data from Tables 2.5 and 2.6 


relationship using PISA data from 25 countries. Stoet and Geary [122] provide four 
plots, their Figure 2, representing PISA data spanning a decade, with axes defined 
as in Fig. 6.3, which show —0.78 < r < —0.60. They regarded these negative 
correlations as very troubling: “...a hitherto unexplained paradoxical finding: The 
smaller the sex differences in mathematics, the larger the sex differences in reading 
(1.е., countries with a smaller sex difference in mathematics have a larger sex 
differences in reading, and countries with larger sex differences in mathematics 
have a smaller sex difference in reading). This inverse relation between the sex 
differences in mathematics and reading achievement poses a critical challenge for 
educators and policy makers who might wish to eliminate such differences ... there 
are currently no countries that have successfully eliminated both the sex difference 
in mathematics. . . and the sex difference in reading. . . [122, p. 2, italics in original].” 

In fact, the negative relationship can be the consequence of a simple obvious 
observation: Math skills are not required for reading test performance, but certainly 
some reading skills are necessary for some math test items. Girls hugely dominate 
boys’ performance in reading. It has been noted [148] that girls’ superior reading 
skill can lead them to opportunities where this skill is particularly advantageous. 
But it may not have been recognized that girls’ reading skills would likely be 
advantageous as well for girls’ understanding of at least some math test items, 
and thus their math and reading scores would be expected to be correlated. This 
correlation explains the paradox. 

Assume that girls’ reading and math scores are positively correlated. Then, 
negative correlations, as observed in, e.g., Fig. 6.3, fall out immediately as an 
expected empirical consequence. This result may seem counterintuitive. 

To show this, define random variables Bbm, Bgm, Bpr, and Bgr, where subscripts 
denote b for boys, g for girls, m for math, and r for reading. Thus, Bg, is a girl's 
reading score random variable. Let o(-) denote the correlation between pairs of 
random variables. Assume p (Bgm, Bg,) > 0, which is equivalent to the covariance 
cov(Bgm, Ber) > 0. Other pairs of random variables are assumed independent of 
each other and thus are zero correlated. 
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Next, define 
M = Bom — Bgm 
and 
R = Bg, — Bor. 


This pair of expressions is the probability model to evaluate. It is the sign of 
correlation between M and R, that is, о(М, R), which is of concern. Because 
the sign of the covariance dictates the sign of the correlation, it is sufficient to 
consider the covariances, and without loss of generality, it may be assumed all 
random variables have zero mean. 


cov(M, R) = cov(— Bgm, Ber) = —cov (Bgm, Bgr) > p(M, R) < 0. 


If for boys o (Bpm, Bpr) were also assumed to be positive as well, as seems plausible, 
the same result would hold but drive o(M, R) more negative. Thus, assuming that 
girls’ math and reading scores are positively correlated is a sufficient condition to 
produce the negative M and R difference score correlation, and consequently the 
resulting empirical correlations, the negative rs, are simply realizations of what is 
expected. 

Marks interpreted the correlation as having a causal thrust: “Policies that promote 
girls’ educational performance decrease the gender gap in mathematics but also 
increase the gap in reading [147, p. 105]." If true, this conclusion is disturbing. 
Suppose a policy existed that could reduce the math gap by one-half. Then 5M = 
5 (Bom — Вет), which would shrink the expected mean difference in math by one- 
half. Under the model above, such a change has no influence on R, the reading gap, 
and in particular p(3M , R) = p(M, R) leaving the correlation unchanged. Stoet 
and Geary [122] acknowledge they have no explanation for the paradox and claim 
further study is required. Given the above analysis, further study may not be needed. 


6.9 Can mdo and rdo Be “Chance” Occurrences? 


Could satisfying rdo or mdo simply be random chance events? The following 
focusses on math and mdo; the changes necessary for reading rdo should be 
apparent. Two very different answers must be given. As a first answer, assume 
boys’ and girls’ test scores are all independent and identically distributed, and 
based on random samples, and that their shared distribution is continuous. Then, 
P(X, > x) = P(S? > 82) = 1/2. Assuming in addition boys’ and girls’ test 
distributions are normal, then P(X, > X gN 52 > 52) = 1/4. The probability to 
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consider if mdo is of focus is 
рт = P(Xp > Xg N Sp > 52752 — 52 > Xp — Xg). 


Under normality, pm < 1/4 and is perhaps a crude upper bound. Otherwise, 
evaluating pm would appear to require approximation by simulation. As an example, 
if test scores for both boys and girls followed a ¢ distribution on 15 = df, 
Dm ^7 0.17, with a sample size of 100 of each sex, a probability that appears 
roughly independent of sample size. Should pm ~ 1/4, then a given V satisfying 
mdo could easily occur with sampling variation. However, given k independent V, 
the probability to consider is 1/4*, which becomes vanishing small as k grows. 
Consequently, a sampling variation argument is implausible given the large body of 
data reviewed above in Chap. 2. 

This first answer just given applies to settings where it can be reasonably assumed 
that the elements of V were obtained from a random sample. For none of the large- 
scale national or international surveys, is this assumption appropriate, for reasons 
noted earlier, because the sampling and estimation procedures are far removed from 
arandom sampling model. Thus, a second answer is required, but a definitive answer 
cannot be given. In the case of NAEP data, the consistency with which rdo and mdo 
hold over decades and with the knowledge that very large sample sizes are reflected 
in each sample estimated quantity will have to suffice. There are 66 NAEP V in 
Tables 2.1, 2.2, 2.3, and 2.4 for both reading and math. Among them, 58 satisfy 
either mdo or rdo. Assuming the V are independent of each other and assuming 
1/4 would satisfy mdo or rdo by chance, the expected number is about 17. In 
the end, it is the pervasiveness of the empirical inequalities holding for children 
of various ages, over intervals spanning decades, in studies of varying size, globally 
and historically, for both reading and math that appears compelling and important. 
These empirical facts provide the ultimate justification for implementing Y. 


6.10 Model Y Dimensions and Fitting 


The data of focus are two independent pairs of V (Xp, sp) and (xg, sg) or four data 
points, while Y has four parameters, q, ші, 42, and с. At first glance, one might 
think that the estimation and fitting procedures involving fitting four parameters to 
four data points results in a saturated model and consequently of little use. However, 
an examination of the model and estimation procedures shows that Y is a much more 
tightly constrained model than might first be thought, as has been stated earlier. 
True, four parameters have been estimated in all the examples above. But as 
Observed earlier, only three parameters are required to estimate the critical quantities 
and uniquely fix the graphical display properties: q, ио — ш, and o. Estimating 
u2 and ш fixes the abscissa location of the display. Estimating only u» — ші 
changes nothing substantively in any of the above discussion. The graphs would 
then be unique within an abscissa translation. Furthermore, Y is a much more 
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Fig. 6.4 Parameter space associated with Y inequalities. The rectangular space 0 < q < 0.618 
and шо = ші > О but is otherwise unbounded defines the space where inequalities (4.2) and (4.3) of 
Chap. 4 hold. The curved line denotes the space for inequality (4.4) to hold. These two parameters 
H2 = Hı > О апа q generate the model inequalities 


constrained model than counting the number of parameters reveals. Considering 
math, .V has three inequality constraints expressed in inequalities (4.2), (4.3), 
and (4.4) of Chap.4, which correspond to the empirical inequalities of mo and 
mdo. In addition, an examination of these inequalities and their analytical arguments 
shows they are independent of c, and thus the inequalities are forced by functions 
of just two parameters, u2 — ш and q. Said in another way, the inequalities are 
forced by properties of Bj and Bg, while №, and Nj play no role in generating Y’s 
inequalities. 

From a geometrical perspective, this means only a two-dimensional space is 
required to capture the parameter constraints, with two parameters q and u2 — ші 
viewed as variables in this space. The space required to capture the model equivalent 
of mo is the rectangular graph shown in Fig. 6.4, open on the right side, because of 
the arbitrary upper bound ио — ш < 50 in the graph, but to capture mo only requires 
that O < q < 0.618 and ш < u2 or 0 < рә — ш. 

To satisfy the model constraints corresponding to mdo, further constrains must 
be imposed: их — ші > 1 and q < [V5 — 4/7 (ро — ш) — 1]/2 < 0.618. The area 
below the curved line in Fig. 6.4 shows the corresponding model parameter space. 
Thus, Fig. 6.4 makes clear the analytical inequality properties of Y are determined 
by just two parameters. From a very different perspective, these results show just 
how readily falsifiable .V is. While the focus here has been on math, a similar 
development follows easily for reading. 


6.11 Y and the Global Gender Gap Index 


Yearly, the World Economic Forum releases an index of gender equality for each of 
about 150 different countries. Called the Global Gender Gap Index, or here GGGI, 
is a zero to one bounded scalar index, with one presumably denoting equality. It 
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is a composite of four assessed domains for equality: economic participation and 
opportunities, educational attainment, health and survival, and political empower- 
ment. GGGI is billed as a measure of “gender equality," which “assesses countries 
on how well they are dividing their resources and opportunities among their male 
and female populations, regardless of the overall levels of these resources and 
opportunities [149] Any departure of the GGGI from one by design indexes only 
women's disadvantage. The measure cannot reflect any disadvantage for boys or 
men. Intuitively, this fact seems to violate the very premise of a symmetrical equality 
relation which the idea of "gender equality" seems to imply. The GGGI index for 
2022 for the OECD PISA countries is given in Table 5.4, column four [150]. These 
values generally change modestly from year to year. 

It is interesting to consider the range of GGGI values from the perspective of Y. 
The assumption in doing so is that country-wide indices of sex differences in testing 
for both math and reading, as indexed by PISA scores, have wider implications. 
Although assessing children with tests is usually motivated by efforts to assess 
educational progress or use for selection purposes, it is arguably the case tests 
might be taken as a proxy index portending, in perhaps difficult ways to model 
or quantify, sex differences in behaviors reflected in various domains of activity 
long after the tests are taken, perhaps indicating individual life trajectories in a 
country. Furthermore, because girls show some marginal disadvantage in math 
scores in almost all OECD countries, such influences should modestly suppress 
GGGI scores. That is because they would likely increase disparities in domains 
that would contribute to that country's index, thus lowering the GGGI value for 
the country. However, girls show a huge advantage in all OECD countries on PISA 
reading tests, and thus, reading score indices should be positively associated with 
GGGI scores. It is difficult to specify any waking activity in modern culture which 
is independent, or at least uncorrelated, with the ability to read. It is possible to be 
more quantitatively precise. In doing so, all data employed in this section are in 
Tables 5.3 and 5.4. 

Consider first math. To remind the reader, letting qm denote q for math, the upper 
component latent math distribution was weighted qm for boys and аў, for girls, and 
the component weights were the only features that marked their distributions as 
different. Recall gm — d > 0 for 0 < qm < 1. The difference qm — аў, is smallest 
for qm small, and this difference strictly increases in gm for O < qm < 1/2. This 
leads to the prediction that as the difference qm — аў, increases, signaling wider sex 
disparities in various domains for which math performance has relevance, the GGGI 
should modestly decrease. There are 30 countries in Tables 5.3 and 5.4 where pairs 
of these variables are available. Using notation that should be intuitively clear, the 
corresponding correlation is r (Âm — 42, GGGI) = —0.313(0.165) with standard 
error in parenthesis. The corresponding scatter plot appears in Fig. 6.5. 

Consider reading: the reasoning is identical to that for math, with g, denoting 
the reading q. The girls’ latent higher scoring reading component weight is 1 — gi, 
which is larger than the boys’ component weight of 1 — д. Thus, (1 — а?) —(1— 
Gr) = Gr — а?, so precisely the same correlational structure of math and GGGI is of 
focus for reading except that a positive correlation is expected. Retaining the same 
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Fig. 6.5 Scatter plot of GGGI against gm — 42 
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Fig. 6.6 Scatter plot of GGGI against d, — 42 for 29 countries. Japan with coordinates (0.158, 
0.650) is not shown. If Japan were excluded, r — 0.456 for 29 countries 


30 countries as before, r(Q, — 42, GGGI) = 0.441(0.147) with standard error 
in parenthesis. The corresponding scatter plot appears in Fig.6.6, while r(¢m — 
42,4, — 42) = 0.112. 

More broadly, these findings seem noteworthy. For one, they increase the 
plausibility of the wide variability of 4 for reading and math among countries, and 
for another, would seem to allow for Y to be viewed as having saliency within a 
wider context. The predictions are model-based, and the estimated proportion of 
the variance in the GGGI accounted for by the two PISA tests is 0.327 (please see 
Appendix A.5 for clarification). 

The above exposition views GGGI indices, in part at least, as a consequence 
of the same factors, under Y, that produce sex differences in reading and math 
test scores. Other perspectives view the causal suggestion as going in the opposite 
direction: countries with high GGGI, or other similar indices of country-wide 
social equality, are viewed as incubators for, if not causal agents for reducing sex 
differences in math [19]. Furthermore, at the same time, as the math gap for girls 
is presumably under reduction because of social forces, the same social forces are 
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increasing the reading gap advantage for girls [151]. There is, apparently, no safe 
harbor for boys with their substantial mean reading gap. 

An article's headline summary is "Analysis of PISA results suggests that the 
gender gap in math scores disappears in countries with a more gender-equal culture 
[151, p. 1164, italics added]" Now replace the first and second italicized words with, 
respectively, reading and increases. Doing so one has an alternative description of 
their findings. 

The article’s goal is to convince the reader that *. . . in countries with a higher GGI 
index (here ССС), girls close the gender gap by becoming both better in math and 
reading, not by closing the math gap alone [151, р. 1165].” And they contend that 
“In more gender-equal countries, such as Norway and Sweden, the math gender gap 
disappears [151, p. 1164].” The authors repeatedly state that in more gender-equal 
countries the math gender-gap “disappears.” 

While the implied causal linkage between countries with high GGGI and girls’ 
math achievement already appears suspect [152], there is another difficulty: the 
claim the gap disappears in high GGGI countries is misleading hyperbole. The 
gender gap does not disappear, nor are the Norway and Sweden math differences 
especially small. For both countries, 6 < хь — X < 7. For 12 of the 37 countries 
with both GGGI and PISA scores, the math differences are less than 7, and for two 
of these countries the differences are negative, as the data in Table 2.5 reveal. It 
is notable in this context that the U.S. math gap is 6.25, less than Sweden's gap 
of 6.53. This U.S. math gap has not been claimed, in the U.S.A. at least, to have 
"disappeared." 

Norway and Sweden were featured in the authors' chart [151, p. 1164] because 
in 2006, their GGGI values were, along with Finland, the highest. Then, Norway's 
GGGI was 0.799 and for Sweden 0.813 [149]. In Table 5.4, the 2022 values show 
Norway has the third largest among 37 countries with GGGI of 0.845, and Swedens' 
0.822 is the fifth largest. 

The authors, as well as here, apparently employ the 2003 PISA cycle data. 
Whether the source files are identical, however, cannot be determined. For 37 
countries with GGGI values and PISA scores, for math, r(xy — Xg, GGGI) = 
— 0.426, so smaller math disparities are associated with higher GGGI. 

The problem with the authors' interpretation arises when reading is considered. 
If countries with equitable resource distribution are credited with reducing math 
disparities, would not it be plausible to expect reading disparities would be reduced 
as well? This is most certainly not the case. For PISA reading differences, r(x, — 
Xp, GGGI) — 0.494 signaling that as GGGI increases, so does the reading gap. The 
authors claim, as they must, and as implausible as it would seem, that a country's 
gender equality increases reading disparities. This does not imply of course that all 
of the x, — хь > 0 of the PISA reading differences are attributable to the presumed 
influence of a country's gender equality. For their two featured countries, Norway 
and Sweden, the reading scores hugely favor girls. 

For Norway, the difference in means is 49.2, the second largest among 41 
countries in Table 2.6; Sweden's difference is 36.75 well above the median of 33.34. 
These differences are many times the size of their math differences. These mean 


6.11 V and the Global Gender Gap Index 89 


differences favoring girls in reading would seem to well exceed the sizes of socially 
or environmentally based influences for nearly any variable of interest. So, what is 
there about a country's gender equality that works dramatically differently on math 
to decrease disparities than it does on reading to increase disparities? Is it to be 
claimed that social equality forces only lift girls’ scores? The issue is not addressed. 

The expected math mean test score difference between boys and girls under Y is 


E(% — Үз) = q(1—4)(u и) > 0,0 < q < 1, ш < ua, 


showing the math gap increases as 4 — 1/2 and it approaches zero as q — O. It 
is also largest for a fixed u2 — ші > 0 when q = 1/2. (Replace the left side with 
E(Y, — Yp) for reading.) 

The .V interpretation for reading and math for Sweden and Norway is quite 
different of course. First note that for both countries V for both reading and math 
are nearly identical with their V. ĝm is small for both countries, and that is why 
the math mean sex differences are small. For Norway, gm = 0.026 and for Sweden 
Qm = 0.052, the seventh and thirteen smallest values among those in Table 5.3. For 
reading, for Sweden g, = 0.541, and for Norway д, = 0.506, the fifth and eighth 
largest values in Table 5.3. Because the ĝ, are large, the reading mean difference is 
large. The rough similarity in these two countries' q estimates for both reading and 
math would seem to be best understood, as suggested before, by their geography, a 
fact not mentioned in the article of focus. The two countries share a 1630 kilometer 
border and thus are thought to share similar gene pools [120]. 

Another application of GGGI has led to a spectacular failure. It was expected 
there would be a positive correlation between women's participation in college level 
math focused curricula among different countries and their GGGI index. Instead, a 
strikingly negative correlation, r — —0.47, appeared, a finding called a paradox by 
Stoet and Geary [153]. This is the second such correlational finding so labelled by 
them as a paradox. The first “paradox” was discussed above in Sect. 6.8. 

The most common explanation for such findings continues to be gender 
stereotyping. Subsequently a gender stereotype variable, GMS, yielded 
r(GMS, GGGI) = 0.291 for OECD countries [154, p. 30, Table S4A], the most 
suitable comparison for the data available here. For a larger sample of countries, 
r(GMS, GGGI) — 0.434 [155, Table 1]. Thus, GMS accounts for about 946 or at 
most about 19% of the variance of GGGI. Using both math and reading PISA tests, 
Y accounts for nearly one-third of the GGGI variance, 32.7% and given above, well 
more than does GMS. 

While such correlations may be of interest, the spirit of a core global issue seems 
best captured by observations of Ceci and Williams [5, p. 168] And discussed 
earlier: why is the male to female ratio of computer scientists in The Czech 
Republic more than six, while in the U.S.A. that ratio is about two? To the 
extent to which PISA math scores address the matter and to briefly return to that 
discussion here, .V provides at least part of the answer. Figure 6.7 shows the 
estimated upper tail distribution functions under .V for both the U.S.A. and The 
Czech Republic assuming component normality. The upper tail sex differences are 
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Fig. 6.7 Estimated distribution function upper tails for PISA math for boys and girls, The Czech 
Republic and the United States 


far more pronounced in The Czech Republic than in the U.S.A., largely because 
Gm = 0.306 and 0.025 for The Czech Republic and U.S.A., respectively, and as 
noted earlier. For both countries, the V match their V reported in Table 2.5. 

The GGGI indices employed in the above analyses are from 2022, while the 
PISA data are from the 2003 assessment cycle. Thus, nearly two decades separates 
the assessment times of the two variables of focus. This fact would seem to suggest 
that the durability of the findings is to be expected. 

Finally, please keep in mind that none of the references in this section which 
attempt to address the mean gender gap in math consider variance differences, 
the largest of the sex differences in both math and reading testing. To repeat yet 
again, any coherent explanation of sex differences must jointly address the variance 
differences as well as the mean differences. Only Y has, so far, achieved that status. 


6.12 The Misleading Language and Images of Sex 
Differences 


Both the language psychologists have used to refer to sex differences, and the 
graphical images drawn to portray such differences have been misleading. 
Consider language first. Words can shape beliefs and perceptions. Perhaps no 
two words have been used more often to characterize sex differences, certainly 
with respect to math and secondarily with respect to reading (as well as many other 
domains of focus) than “gender gap.” Search with the words “gender gap" in any 
browser and millions of hits are revealed. Gender gap appears in the titles of several 
books; it appears in three of fifteen chapter titles in a single book addressing math 
sex differences [79] and 128 times in single book [5]. Gender gap appears rarely 
explicitly defined, but it seems clear that it is taken to mean sample mean test score 
differences at least where math and reading are of focus. The popularity of the two 
words being so, gender gap would seem to fail to satisfy a "real" definition, meaning 
to convey the "essential nature" or "essential attributes" of some entity [75, p. 93]. 
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Certainly, in math testing, x» — Xg > 0 typically. So considering gender gap as 
equivalent to boys' and girls' mean math test score difference is not wrong. 

However, once it is recognized how widely the inequality mdo holds and when it 
is realized how much larger эў — s? is than хр — Xg for math, any “essential nature" 
scalar characterizing sex differences in math and reading should be the variance 
difference. The median of the 41 ratios T = 52) /|Xp — Xg| for PISA math scores of 
Table 2.5 is 111. For PISA reading scores, Table 2.6, the median of the ratio is 46. 
Is it unreasonable to suggest that for many years the focus has been on the mean 
gender gap when it should be on the variance gender gap? 

The figures or graphs of distributions intended to portray sex differences in math 
are misleading. Invariably, portrayed are equal variance but shifted normal distri- 
butions. This is certainly an empirically wrong visual image, and it is conceptually 
misleading as well. 


6.13 Coda 


In the executive summary of Why So Few? a book that concerns why there are 
few women in math and related fields, the authors write “While biological gender 
differences, yet to be well understood, may play a role, they clearly are not the 
whole story [156, p. хіу] Nothing written in these chapters contradicts the spirit 
of this statement. What has been shown, however, is that virtually all of the reading 
and math test-based sex differences displayed by children in observational settings, 
especially those of S, can be explained by a simple model. 

View the foregoing effort as an attempt to advance understanding of the 
biological basis of the differences expressed in the above quote. In the process, 
the effort has hopefully illuminated the far larger mean difference favoring girls in 
reading that has been mostly ignored. It is literacy, not math, that is the far larger 
and more important skill for children to acquire. 

It would appear time to focus attention on the multitude of those sex differences 
which are most likely under some form of environmental or societal control and not 
mentioned above. 
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Appendix A 
Arguments, Estimation, and R Code 


A.1 The Distribution of Y 


The unconditional distribution of = В + М. 

Set P(B = uk) = лу, лу + m2 = 1,0 < лк < 1,k = 1,2. The cumulative 
distribution function of Y is F(y) = P(Y < y) = E P(Y € у, B = ug) = 
Y3 PO < y|B = щ)Р(В = ш). Write P(Y < y|B = ик) = Fi), and 
then F(y) = P(Y € y) = mı Fı (y) + zo Po(y). If f (y) is either a probability mass 
function or a density function, then f(y) = лі fı (y) + ло f2(y). The distributions 
for reading and math, for boys and girls, are then easily specified. 


A.2 Inequality Arguments 


The Y mean, variance, and inequality arguments for math: 
Conditions: и < 2,0 < q < 1, so q? < q for all q. 


EY») = иь = (1 — q)uı + qua (A.1) 
E(Y,) = ug = (1 — qua +420. (A.2) 


From (A.1) and (A.2), 


Ub > Hg (A.3) 
var(Yp) = of = (ио — m1} 4 (0 — 4) + o? (A4) 

var(Y,) = 02 = (ua — m) A — 47) +о?. (A.5) 

© The Author(s) 2024 93 


H. Thomas, Sex Differences in Reading and Math Test Scores of Children, 
Monographs in the Psychology of Education, 
https://doi.org/10.1007/978-3-031-41272-1 


94 A Arguments, Estimation, and R Code 


From (A.4) and (A.5), 
5—1 
o> o «э 0>ай+д-1 «> q < 618 У. А (А.6) 


From (A.1), (A.2), (А.А), and (A.5) with u2 — ш = иа > 0. If 
L = {u4 = ию - ш> 1&0 <q < ([5 – 4/иа]/? – 1)/2 < .618р (А7) 


and L in (A.7) holds, then 


of — о; > Bb — hg- (A.8) 
The argument for (A.8): 
of —02 = 401 — ual — 401+ 9)l (A.9) 
Mp = Hg = 4(1 — q)Ha (A.10) 
of —02 > Up Hug €— иа[1—4(1+4)] > 1 (A.11) 


€ -uag — naq t ua - 1» 0 


«49 47а — 1 
E 


(A.12) 


Inspection of (A.12) reveals that for q > 0 it must be that ua > 1; if wg = oo, 


then q = 35-1 = .618. Thus, for (A.8) to hold, L must hold. 
A similar development follows for reading. 


A.3 Estimation Algorithm 


The moment estimation algorithm for math, given mo: 
Step 1: Execute (A.13) and (A.14): 


o 40) Qu n? __4@—4) 


5р —52 of — o2 ~ 1-q(14 4) 


(A.13) 


Ida sut coda 
co жаа: (А.14) 
2(1—) 
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Xp — XQ = up — Hg = (ио — ш) 4) (A.15) 
Sp — Sg = ор — о; = (ио — nq — D — 401+ 4) (A.16) 
Xp = иь = (1 — q)ui + qua (A.17) 
X, = ug = 0 —q5 + q? uo (A.18) 
s = о? + 4(1 — Фо — ш)? (А.19) 
52 = 02 +4201 — q^ — на)”. (A.20) 
Step 2: Estimate ил = ио — ші using (A.14) and (A.16): 
йа = (ив т) = JG] — sD/àQ — DU — ÂC + 0). (А21) 
Using (А.14) and (A.15), 
Йа = (из — ш) = Gy  X9/4( — д). (A.22) 
(A.22) is used in the algorithm. 
Step 3: Estimate o? using (A.19) and (A.20). 
92 = sy — [да — 4) (A.23) 
and 
62 =з; — 9420 - â’). (A.24) 
The two ó? are averaged and set equal to 6? if they are positive. Estimates can be 


negative, so-called Heywood cases in which case estimation fails. 


Three parameters are estimated above, jig, q, and a”, so the estimation is 
complete save for translations of /ь(у) and f,(y) on the real line. However, ик 


can be estimated to uniquely fix their locations. 
Step 4: Estimate ик: 


^ Xg — qXp 
H1 = A 
1—4 


~ _ 0) — х, 
DE иш. 


(A.25) 


(A.26) 
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The following provides the argument for (A.25) and (A.26). 


(y - І я) 
са ^ (ad с) \-c a J` 


Rewriting (A.17) and (A.18), 


Using (A.27), 


^ anl x A 
CH esa) 
1-4? 4? 409-1) 442-11-24 


Using (А.28) and (A.29), 


US "auem uic) s) 
йо) 44 -D M^ -11-4/ (8, 


and R Code 


(A.27) 


(A.28) 


(A.29) 


(A.30) 


giving (A.25) and (A.26). Note (A.26) minus (A.25) equals (A.22). The changes 


required for estimates of reading V should be clear. 


A.4 R Function mathgap 


Math estimation algorithm in R. Required is vector v lower case with four or six 


elements satisfying mo. 


mathgap«-function()( 
mb<-v [1] ;mge-v [2] ; vb«e-v [31^2; vg«e-v [4] ^2 


nv«e-c ("q" К "q^2" ; "па" Р "mj" " "m2" n "yr (N) т i "хр" ; "уха" " "h^ 


"h^2 а" QU tvb" ‚" tvg" ) 
if (length (у) ==4) { 
d<- ( (mb-mg) ^2) / (vb-vg) 
estq«- ( (1«d) -sqrt (5«d^2-2xd«1)) / (2« (1-3)) 
тї1<- (у [2] -estq«v[1]) / (1-estq) 
m2«- ( (1«estq) «v[1]-v[2]) /estq 
md<-m2-m1 


2р ША 


vxlb<-md^2xestq>» (l-estq);vxlg«-md^2« (estq^2) « (1-estq^2) 


vsS«-c(vb-md^2xestq«(1-estq),vg-md^2xestq^2« (l-estq^ 


if (sum(vs>0) ==2) vstuff<-mean (vs) else 

{print ("Heywood") ; vstuff<-max (vs) } 

gv<<-pv<-c(estg,estq*2,md,m1,m2,vstuff,vxlb,vxlg, 
vxlb/ (vxlb+vstuff) ,vxlg/(vxlg+vstuff) , 
vstuff+vxlb, vstuff+vxlg) 

names (pv) <-nv 

pv<-round (pv, 3) 


2)) 
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return(pv) 


if (length (у) ==6) { 
nb«-v[5] ;ng<-v[6] 
d<- ( (mb-mg) ^2- (vb/nb+vg/ng) ) / (vb-vg) 
if(d«0) d<- ((mb-mg) ^2) / (vb-vg) 
estq«- ( (1«d) -sqrt (5«d^2-2xd«1)) / (2« (1-3)) 
ml1«- (v[2]-estq*v[1]) / (1-estq) 
m2«- ( (1«estq) *«v[1] -v[2]) /estq 
md<-m2-m1 
vxlb«-md^2x«estqx (1-estq) ; vxlg<-md*2« (estq*2) « (1-estq^2) 
vsS«-c(vb-md^2xestq« (1-estq),vg-md^2xestq^2* (l-estq^2)) 
if (sum (vs»0) ==2) {vstuff<- (vs [1] «nb«vs [2] «ng) / (nb«ng) } else 
{print ("Heywood") ;vstuff<-max (vs) } 
gv<<-pv<-c(estg,estq*2,md,m1,m2,vstuff,vxlb,vxlg, 
vxlb/ (vxlb+vstuff) ,vxlg/(vxlg+vstuff), 
vstuff+vxlb,vstuff+vxlg) 


names (pv) <-nv 
pv<-round (pv, 3) 
return (pv) 


A.5 Conditional r Variance 


The proportion of variance of x given y and z. 

Suppose there are three variables, x, y, and z, with r,, denoting the sample 
correlation coefficient between x and y and similarly ryz and ryz. Desired is the 
estimated proportion of the variance of x accounted for by y and z. The expression 
is 


1 2 2 2 
T-A (ry + riz — 2rxyrxzry:). 
yz 


The expression is distribution free, but the linearity of the corresponding regressions 
is assumed. For application here, set x = GGGI, y = Gm — 42, and z = ў, — Q2. 
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